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introduction 

/  '  r  -i  ijjf,  - 

This  progress  report  consists  of  two  volumes:  the  first  focuses 
on  the  theoretical  foundations  of  outwork  In  detecting  and  exploiting 
parallellsm^the  second  amounts  to  a  preliminary  user's  manual  for  the 
ILLIAC-IV  FORTRAN  system.  Volume  1  also  oontains  a  description  of  the 
prototype  of  the  eventual  Parallelism-Analyzer. 

^  ' 

During  this  period,  we  have  completed  the  following  accomplishments: 


•  Refinement  and  extension  of  algorithms  for  detection  and 
exploitation  of  parallelism. 

•  Specification  of  the  syntax  and  semantics  of  IVTRAN,  the 
extension  of  FORTRAN- IV  designed  for  ILLIAC-IV. 

•  Functional  specification  of  a  powerful  linking  loader. 

•  Development  of  a  prototype  parallel  analyzer. 

•  Near  completion  of  the  parser  portion  of  the  compiler. 

•  Investigation  of  certain  algorithms  required  to  support 
optimization. 


'  Investlvatlon  of  algorithms  for  efficient  allocation  of  arrays 
within  P.E.  memory  (the  "knap-sack"  problem). 

•  Rationalization  of  the  ILLIAC-IV  Instruction  set  as  required 
for  code-selection. 


This  report  concentrates  on  the  extended  FORTRAN  (IVTRAN)  and 

re!°^thm  T dertVlnS’  “  fr°m  conven“OMl  FORTRAN.  The  next  semi-annual 
report  will  focus  on  optimization,  data  allocation,  and  code  selection 
algorithms. 
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me,  .ed.T  r  7  '  Chaptar  1  “‘s-sses  several 

77,  T,10"  °f  UPla“'S  eqUa“°n  a  P—JM  computer  by 

relaxation  methods.  Chapter  II  presents  parallelism-detection  methods 

Z  JT  S  Pr°CeSSOrs  while  c»aPter  III  restricts  these  methods  to 

I:  :  l°  Chapter  IV  then  considers  TeLZ 

aneiTer  tl0nS'  Chapter  Vl — «*"  *  Prototype  parallel 

* 

Volume  II  contains  three  chapters  Th*»  fire*  i. 

ILLIAC-IV  FORTRAN  entitled  "Control  Structures  in  ILLIAC-IV  FOrTrAN^0 

su Ch  T,  "  ‘S  a  Prellmlnary  verslon  of  a  language  manual  for  IVTRAN  hi 
h,  it  Is  essentially  complete  with  the  exception  of  identifiable 

Which  Will  be  aval, able  as  the  prefect  progresses;  e.g. .  a  s“  7e7s  , 

diagnostics  is  determined.  Chapter  III  presents  ,he  functional  specification 
of  the  supporting  Linkage  Editor.  specification 

tarns  fac'rr9  ‘a"9Ua9e  ‘'S  “"k  adl,or-  we  have  surveyed  sys- 

IBM  T/  T  CUrre"  USS  °"  'ar9e  SCale  SCle"*‘«P  systems  such  as 

FORTIN  ,  CD°  6000/7°00-  SUCC8SSfUl  lmP*ementatlon  of  a  useable 
fortran  system  on  ILLIAC-IV  maans  that  run-time  operations  such  as  format 

C°"’PUtatl0n  °f  Standard  taathematlcal  functions,  and  overlay 

opera  I„07Stt  S“PPOr,ed-  ^  idenUfled  «»  need  for 

existence  Ts  S“PPOr\re,Ulred  by  a  F0RTRAN  system  and  assumed  its 
FCRTr  ObT6  W,U  *  SUitabla  to  users  of  ILLIAC-IV 

In  J7  “Sly'  we  Tequire  PPP’P'Pte  specification  of  those  features 
er  o  complete  the  code  selector  and  the  linkage  editor. 


CHAPTER  I 


THE  SOLUTION  OF  LAPLACE'S  EQUATION  ON  A 
PARALLEL  COMPUTER  BY  RELAXATION  METHODS 


THE  SOLUTION  OF  TAPT-APE'S  EQUATION  ON  A 
PARALLEL  COMPUTER  BY  RELAXATION  METHODS 


Relaxation 


The  computer  solution  of  Laplace's  equation*  on  a  two-dimensional 
rectangular  domain  with  prescribed  boundary  values  Involves  finding  a  solution 
of  the  following  system  of  linear  equations: 

(1)  Ml. J)  -  .25  *  [A(t-  1.  J)  +A<!+  1,  J)  +A(I.  J-  1)  +A(I,  j  +  1)]. 

forl<[<M,l<J<N,  with  prescribed  "boundary"  values  for 
A(0,  J)  ,  A(M  +  1,  J)  ,  A  (I,  0)  ,  A(I,  N  +  1)  . 

Let  D  denote  the  set  t  (I.  I)  :  1  <  I  <  M,  1  <  ,  <  N  ,  of  „lnterlor 

points".  The  ordinary  relaxation  method  solves  this  system  of  equations  by 
the  foilowina  Drncess* 


1.  Choose  any  initial  values  for  A(I,  J)  [  (I.  J)  e  D  ] 

2.  Execute  Equation  (1)  as  a  FORTRAN  statement  for  each 
(I*  J)  e  D,  in  some  fixed  order.** 

3.  Repeat  Step  2  until  convergence  is  obtained  -  i.e. ,  until 

successively  computed  values  of  A(I,  J)  are  suffibiently  close, 
for  all  (I,  j)  €  d. 

We  will  call  Step  2  a  single  relaxation  step. 

The  usual  algorithms  for  the  relaxation  step  for  a  sequential  computer 
is  to  enclose  statement  (1)  in  the  following  DO  loop: 


For  convenience,  we  only  consider  Laplace's  equation.  The  generalization 
to  Poisson's  equation  is  trivial. 

Actually.  In  the  chaotic  methods  discussed  later,  the  order  randomly  changes. 


(2)  DO  I  =  1 ,  M 
DO  J  =  1 ,  N  . 


This  gives  the  Gauss-Seidel  method. 

The  obvious  ways  of  programming  the  relaxation  step  for  an  arrav 

puter  like  the  ILLUC-IV  are-  r*  u,  «  P  an  anray  com- 

^  1Vare>  (a)  the  i^cobi  method,  which  uses  a 

DO  FOR  ALL  (I,  J)/D 
loop,  and  (b)  the  row  method  using  a 
DO  1=  1,  M 

DOFORALL  J/f  1,  2...  n] 

rizr the — — «■ 

m  .he  ^.eir„TLiar.riTr5ence  for  these  are 

to  .he  expected  number  of^xaZ  Z^ZZTZl  ~ 

‘he  solution.  A  smell  number  of  test  rases  have  a„dT  ^  °f 

well  with  these  values .  '  "d  they  agree  gulte 

number  TZ  °“  *  *1.™'°°*  *  «— **  *h. 

step  -  which  are  sed  f  *  dur‘”9  the  >*«“>.  relaxation 

P  ch  are  used  in  computing  A(I,  j).  They  are:** 


Ihey  are  obtained  from  the  spectral  radius  of  the  Iteration  matrix  for  the 
particuiar  method,  as  defined  in  Til  The  «.,♦_«  . 

Gauss-Seldel  methods  can  L  ^und'ln  W  ^  ‘  "  ^  >*"“  “* 


We  are  neglecting  the  points  adjacent  to  the  boundary. 
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Jacobi 


-  none 


row 

Gauss -Seidel 


-  one  f  A(I  -  l ,  j)  ] 

-  two  f  A(I-  l,  j),  A(I,  J  -  1)]  . 


Z:r:zmluesu^  “ft*  «»«*...  Thus,  eheGa,„. 

Selde.  method  converges  ln  fewer  relation  stepe  then  the  pare,, el  methods 

r:rr  it:  tt  computation  °f  eaoh  step  — — — 

,  thlS-  Hence'  the  racobl  method  is  faster  than  the  row  method  whloh  Is  faster 
than  the  Gause-Seidel  method  (assuming  a  large  enough  number  of  p, lessors, 


Ordinary 

Relaxation 


Over-Relaxation 


Jacobi 

Method  1  1 


Row 

Method  4/3  2 


Gauss-Seldel 

Method 


1.8  MN 


M 


3  +  N2 


TABLE  1 

Approximate  Relative  Theoretical 
Rates  of  Convergence 

(Compared  to  Jacobi  Method) 


Over-Relaxation 


In  over-relaxation  methods,  statement  (1)  Is  generalized  to 

(3)  A(I.  J)  =  «,  •  .251  A(I  -  1,  J)  ♦  AO  +  1,  J)  +  AO.  J  -  I)  ,  +  „  j 


+  (1  -  UJ)  *  A(I,  J)  , 

where  »  Is  a  fixed  parameter.  *  (If--  1,  (3)  reduces  to  (1) .)  This  yields  no 
provement  In  the  Jacobi  method .  but  can  speed  up  the  other  two  methods  The 
optima,  choice  of  -  depends  upon  M  and  N.  For  M.  N  »  1.  „  15  slightly  less 

The  olive  ”  m<!th0d  *—  2  f“  the  ^«=-SeIdel  method. 

Tte  convergence  rates  given  In  Table  1  were  obtained  using  this  optima,  choice 

or  each  method.  For  the  row  and  Gauss-Seldel  methods .  the  faster  con¬ 
vergence  more  than  compensates  for  the  extra  computations  In  (3). 

The  enormous  speed-up  of  the  Gauss-Seldel  method  predicted  by  Table  1 
does  not  occur  In  practice.  The  theoretical  convergence  rates  are  asymptotic  limits 

as  the  number  of  steps  goes  to  infinity  Thev  nni  * 

K  to  iniinily*  They  hold  only  for  computations  requiring 

ry  g  at  accuracy,  and  thus  a  very  large  number  of  relaxation  steps.  The 

theoretical  convergence  tate  for  the  Gauss-Seldel  over-relaxation  method  1, 

probably  valid  only  when  the  number  of  steps  needed  In  the  ordinary  method  Is 

»  MN.  In  practice,  this  is  never  the  case. 


In  practice,  the  following  “statements"  appear  In  the  relaxation  loop: 


A  -  .25  *  [A(I  -  I,  J)  +  A(I  +  1,  J)  +  A(I,  J  -  1)  +  A(I,  J  +  1)]  -  A(I.  J) 
A(I,  J)  «  u)  *  a  +  A(I,  J)  , 


and  A  is  then,  used  in  the 
(with  subscripting)  for  the 
Gauss-Seldel  i..othod. 


convergence  test.  Here,  A  is  an  array  variable 
parallel  methods,  and  is  usually  a  scalar  for  the 


Hyperplane  Methnrig 


bath  th  Th°re'h0dS  deVelOPed  f2)  Sh0W  tha*  the  GaiJSS~seldel  loop  (2)  lor 
lo9p  a"d  0Ve''rela>taUOT  ^»vale„,  ,o  the  following 


(4)  DO  K  =  2 ,  M  +  N 

DO  CONC  FOR  AtL  (I,  J)  e  f  (x.  y)  e  D  :  x*y=K)  . 

which  yields  the  hyjerplane  method.  The  "CONC”  Indicates  that  the  DO  FOR  ALL 
^  °XeCUted  by  We^nden,  asynchronous  processors .  The  DO  FOR  ALL 


*  C(x,  y)  e  D  :  x  +  y  *  k  ) 


Is  shown  In  Figure  1  for  various  values  of  K. 


The  hyperplane  method  naturally  ha.  the  same  rate  of  convergence  as 

MUo^sTeVs  “•  COmPirtn9  ra,eS  convergence  ami  number  of  compu- 
th,  T  h  Per  "tep'  «  15  s««  that  the  row  method  Is  still  faster 

.  n  ,f>e  hVP«Piane  method  for  ordinary  relaxation.  For  over-relaxation ,  the 
hyperplane  method  Is  superior.  ' 

(1  T1  e  ne,hTrPlane  meth0d  Ca"  be  PlpeUned  «  Allows.  While  a  DO  FOR  AU 
'  n  nK  ‘Z  6XeCU,ed  ''  a  ->«at.o„  Step,  the  DO  FOR^I 

L  a,  k -<T  ,  rCU“d  *'  ^  “me  f0f  tha  "a"  relaxation  step,  ,G 

21  rewrite  the'  h '  7  *  a"d  ■  -  can 

foil  owing1  loop-8  meU,0d  “  “he  hVPerplana  methrer  with  the 


(5)  DO  K  *  0  ,  |i  -  1 

DO  CONC  FOR  ALL  (I,  J)  e  (  (x,  y)  e  D  :  X  +  y  ■  K  mod  u  , 


for  any  u  >  1 . 


.h^^e  :::::zz:e  thmne  ^  •“«*  *»  «•**>  - 

method  equals  that  of  the  Gau ss -Se W^lTeTh^  PlPe“ned  hyperplane 

optimal  value  of  w  for  over-relaxation.  ’  °re°Ver>  11  has  tha 


Gauss-Seldel  fa“forfo  t“ZT‘^ZT  ^ 

ir  rr ““  *zr ne  — 

-  - 1  -  *  -  - -  -  ^::rr::TLr:ciauy  * 


The  Z-stripe  method  requires  the  same  number  of  sequential  computations 
per  relaxation  step  as  the  row  method,  but  converges  faster  -  by  a  factor  of  1.5 
for  ordinary  relaxation,  and  by  a  much  greater  factor  for  ov  r-relaxation.  More¬ 
over,  observe  from  Figure  2  that  the  sets  Z K  always  contain  exactly  one  element 
from  each  row.  The  Z-stripe  method  is  therefore  well  suited  to  the  ILLIAC.  The 
extra  overhead  in  address  calculation  of  the  Z-stripe  method  compared  to  the  row 
method  is  small.  The  only  disadvantage  of  the  Z-stripe  method  is  that  if 
N  /  0  mod  64,  then  there  is  wasted  memory  space  in  storing  the  array  A.  This 
space  can  often  be  used  for  storing  other  variables.  Unless  there  is  a  critical 
shortage  of  memory  space,  the  Z-stripe  method  should  be  used  for  the  ILLIAC. 

The  Z-stripe  method  is  probably  the  best  method  for  the  ILLIAC  when 
NM  »  64 ,  since 


( i)  M  processors  are  used ,  and 

(ii)  Each  computation  of  A(I,  J)  uses  two  new  values.* 

Obviously,  (ii)  cannot  be  improved  upon.  As  for  (i) ,  any  attempt  to  use  more 
than  M  processors  probably  requires  either  a  wasteful  storage  allocation  scheme 
or  an  inefficient  routing  process.  The  most  likely  candidate  for  improving  upon 
the  Z-stripe  method  is  the  pipelined  hyperplane  method  with  u  *  M/2  when 
M  -  32.  The  storage  allocation  problem  for  this  method  has  not  been  studied. 


The  Checkerboard  MP»hr>H 


z  r::~hr:‘a:reth"rs  the^^M 

*  * *  - *> -  « c  jzzzzzziz  rrr array 

pattern.  Then  a  single  relaxation  «,♦  clack  In  a  checkerboard 

sImultaneously  te  “rrrr of  arst  compu,in9  ^ » 

(I,  J) .  *  ■  B  *  then  comPullng  A(I,J)  for  aU  red  polnts 


per  irrrr  r  “  °f  se“*  «*■«*». 

same  number  of  computations  for  ordinal  meth0<iS  Tequire  tfie 

method  is  far  superior  with  over-relaxation  M  ’  OWeVeF'  the  chet*erboard 
-an,  processors  as  the  Jacobi  JtlTT  7°™'  “  °nly  haK  as 

method  sh°uw  *» —  - •»  -puter^itb  eM°xzn rrlaxation 

method  forLynZnoZZZr’mshreVer'1biS  'h6  ^  checkerbMrcl 

in  (5).  In  general  the  difflTlr!  '  *s  Pcsslble  because  of  the  DO  CONC 

computations  is  that  they  must  be 

because  some  processors  must  be  idle  while  waiting  for  ih  lnefflcIency 
the  checkerboard  method  onlv  two  u  “  f  °thers  t0  flnlsh*  Wlth 

step.  The  inefficioncyt CloZs  lZr0"8  7 

smaller  than  MN/2.  **“  number  of  Processors  is  much 

Such  methods .  called  chaotiT  ^"'1!,'.”^““  ”nCto”,ness  tato  «>«  computation. 
However,  the  faster  raZf  c  9  ^  tMted  *  RosenlM  M. 

coloring  worksZX  will.”' ”  ^  b“'C''  '°  Confo™  w‘th  <«•  but  either 

For  efficient  coding,  the  computer  probably  needs  M  .  Tn  +  D/P 

where  ry1  denotes  the  smallest  Integer  >  x.  "  Processors. 


required  synchronization.*  Thus,  the  checkerboard  method,  sr-ems  to  be  the 
best  possible  one  for  a  computer  with  asynchronous  processors  -  at  least  when 
the  number  of  processors  is  <  MN/2. 


r  , 

It  is  not  known  whether  there  is  any  u>  >  1  for  which  the  chaotic  over-relaxation 
metnod  always  converges .  Tests  described  in  [3]  indicate  that  over-relaxation 
does  not  improve  the  chaotic  method  nearly  as  much  as  it  does  the  checkerboard 
method  when  the  number  of  processors  is  »  1, 
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1.  THE  GIVEN  T.OOP 


We  will  consider  DO  loops  of  the  following  form: 

(1)  DO  a  I1  s  jfc1,  u1 

DO  «  ln  =  t",  u" 

loop  body 
a  CONTINUE 

where  the  i1  and  u1  are  positive  integers,*  and  the  loop  body  has  no  I/O 
statements,  no  subroutine  or  function  calls  which  can  modify  data,  and 
no  transfer  of  control  to  any  statement  outside  the  loop.  The  extension  to 
more  general  loops  will  be  discussed  later. 

Let  Z  denote  the  set  of  all  integers,  and  let  Zn  denote  the  set  of 
n-tuples  .of  integers.  For  completeness,  define  xP  =  (0)  . 

The  index  set  i P  of  the  loop  (1)  is  defined  to  be  the  subset 
{  (i1,  . .. ,  ln)  :  s  ijsuJ]  of  Zn.  Thus,  for  the  loop 

DO  7  I1-  1,  10 

DO  7  I2®  1,  20 


r  • 

The  use  of  superscripts  and  subscripts  is  ln  accord  with  the  usual 
notation  of  tensor  algebra . 


{  (x,  y)  :  1  5  x  «  10,  x$  y  g  20  )  . 


S>. 

An  execution  of  the  loop  body  for  an  element  (p1,  . . . ,  pn)  of 

Is  the  process  of  setting  I1  =  p1 . In=  pn  and  then  executing  the  loop 

body  in  the  usual  fashion,  stopping  when  statement  a  is  reached. 
Executing  the  entire  loop  (1)  then  involves  the  execution  of  the  loop  body 
for  each  element  of  9  .  in  the  order  specified  by  the  DO  statements . 

This  suggests  that  we  order  the  elements  of  Zn  lexicographically 
in  the  usual  manner,  with  (2,9,  13)  <  (3,  -1,  10)  <  (3,  0,  0).  Then  for 
any  elements  P  and  Q  of  9  ,  the  loop  body  is  executed  for  P  before  it  is 
executed  for  0  if  and  only  if  P  <  Q.  Thus,  the  relation  <  on  Zn  gives  the 
appropriate  temporal  ordering  of  9  •  In  the  preceding  example,  the  loop 
body  is  executed  fo.  (2,  11)  before  it  is  executed  for  (3,  5),  since 
(2,  11)  <  (3,  5). 

Define  addition  and  subtraction  of  elements  of  z"  by  coordinate- 
wise  addition  and  subtraction,  as  usual.  Thus,  (3,  -1,0)+  (2,  2,  4)  « 
(5,1,4).  Let  0  denote  the  element  (0,  0,  . . .,  0).  It  is  easy  to  see  that 
for  any  P,  0  e  z”,  we  have  P  <  Q  if  and  only  if  Q  -  P>  0*. 
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U‘  -THE  do  conc  statement 

Our  objective  is  to  And  a  new  temporal  ordering  of  the  executions  of 
the  loop  body  so  that  at  any  given  time,  the  loop  body  is  belna  executed  in 
parallel  for  different  elements  of  the  index  set  by  different  processors.  This 
new  orderina  must  yield  an  algorithm  which  is  equivalent  to  the  one  des¬ 
cribed  by  the  original  loop;  i.e.,  one  which  computes  the  same  values  for 

all  variables  as  the  original  loop. 

Consider  the  loop 

(2)  DO  10  I1  =  l,  3 

DO  10  I2«  2 ,7 

A  .(I1  +  3,  I2)  »  0 
10  .CONTINUE 


The  loop  body  could  be  executed  in  parallel  by  three  processors  for  the 

points  (1.  6),  (2.  5),  and  (3.  4)  of  (I„  feet,  it  could  be  executed  In 
parallel  by  18  processors  for  all  points  in  %}.) 

In  order  to  have  a  means  of  expressing  parallel  computation,  we 
define  the  DO  CONC  (for  CONCurrently)  statement. .  Its  form  is 
DO  a  CONC  FOR  ALL  I  c 

where  j7  Is  .  finite  subset  of  it  has  the  following  meaning:  Let 


We  remind  the  reader  dm,  a  set  is  ,us,  an  urrordered  collection  of  elements. 

t  .}  {  .  )  «  { 1,2, 1,1, 2)  .  We  will  not  bother  to  define  a  syntax 

or  expressing  sets .  The  usual  FORTRAN  DO  syntax,  which  can  only 

describe  a  restricted  class  of  subsets  of  X,  Is  probably  the  mes,  con- 
venient  to  implement. 


% _ _ _  __ _ 

^  “  (ij#  •  •  • ,  im  }#  where  no  two  are  equal,  and  assume  that  we  have 
m  Independent,  completely  asynchronous  processors  numbered  1  through  m« 
Then  each  processor  is  to  execute  the  statements  following  the  DO  CONC 
statement,  through  statement  a  ,  with  processor  number  j  setting  1  = 

The  m  processors  are  to  run  concurrently,  independent  of  one  another. 

As  an  example ,  consider 

DO  10  CONC  FOR  ALL  J  c  (x:  2*  x*  5) 

10  A(J)  =  J  **  2 

This  sets  A(2)  =  4,  A(3)  =  9,  A(4)  =  16  and  A(5)  =  25. 

For  a  DO  CONC  to  give  a  well-defined  algorithm ,  certain  restrictions 
must  be  made  on  the  statements  in  its  range.  Suppose  the  statement 

9  B(f)  =  A(J  +  1) 

Is  inserted  before  statement  10  above.  The  resulting  DO  CONC  loop  does 
not  give  well-defined  results.  For  example,  the  processor  doing  the  com¬ 
putation  for  J  =  3  sets  B(3)  to  the  value  of  A(4).  But  the  value  of  A(4)  it  uses 
depends  upon  whether  or  not  the  processor  for  J  =  4  has  already  executed 
statement  10.  Since  the  processors  are  assumed  to  be  asynchronous,  the 
resulting  value  of  9(3}  is  not  well-defined. 

We  will  not  bother  specifying  the  necessary  restrictions  on  the 
DO  CONC  loop.  The  DO  CONCs  which  will  be  written  appear  in  loops  which 
are  equivalent  to  the  original  DO  loop  (1),  and  hence  must  give  well- 
defined  algorithms. 

The  DO  CONC  statement  is  generalized  to  the  form 
DO  a  CONC  FOR  ALL  (I1 ,  . . . ,  I*)  €  £ , 
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whore  is  a  subset  of 

element  (p1 . pk)  e  £, 

forl'-p1,  ....  jk_  pk_ 


•  The  moantn°  sh°eW  be  dear:  for  each 
we  have  a  processor  performing  the  calculatfon 
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ni.  rewriting  the  loop 

Consider  loop  (2),  with  index  set  J).  Changing  the  order  of 
execution  of  the  loop  body  for  the  different  elements  of  iP  obviously 
does  not  change  the  algorithm.  The  loop  can,  therefore,  be  rewritten  as 
a  single  DO  CONC,  or  In  many  different  ways  as  a  nested  DO/DO  CONC 
loop.  Choosing  one  of  these  ways,  we  rewrite  lt  as  follows: 

(3)  .  DO  10  J1.  3.  10 


DO  10  CONC  FOR  ALL  e  (f:  2|  yj  1  and  J2  -3  j  y  s  -!} 

A  01  -  J2+  3,  J2)=  0 
10  CONTINUE  . 

The  choice  is  arbitrary  and  unnatural  ,  but  instructive . 

To  actually  construct  loop  (3),  we  first  defined  the  one-to-one 
mapping  J:  X2  +  X2  by 


it  (I1. 12)  3-d1  +  i2.  I2) »  01.  J2)  , 
as  Illustrated  In  Figure  1.  We  next  defined  the  Index  set /to  be  the 
aetmP).  (  J(p):pc  ^j.  Then/-  {  (,1,  ,2,.  3  ,  jl ,  2sj2j.  ? 

and  J  3  *  J  «  j  -  i)  _  and  we  filled  In  the  limits  of  the  DO  and  DO 


CONC  statements  to  give  this  Index  set.  Finally.  „e  rewrote  the  loop 
body  in  such  a  way  that  executing  the  body  of  loop  (3)  for  the  point 
Jff)  c/ is  equivalent  to  executing  the  body  of  loop  (2)  for  the  point 

tcj>.  In  other  words,  Atf1  -  J2+  3,  j2,  references  the  same  array 
element  as  A(I*  +  3,  I2)  when  (J*,  J2)  ”  J  [  (I ^ r  j2j  J  . 
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We  can  consider  loop  (3)  to  be  the  same  as  loop  (2) ,  except  with 
a  different  order  of  execution  of  the  body  for  the  elements  of  A  This  order 
of  execution  is  illustrated  in  Figure  2.  The  loop  body  is  executed  con¬ 
currently  for  all  points  in  9  lying  on  a  straight  line  J*  ■  constant.  The 
execution  for  those  points  of  J?  with  I*  =  3  precedes  the  execution  for  the 
points  with  J*  =  4,  which  in  turn  precedes  the  execution  for  the  points 
Mth  J1  «  5,  etc. 

2 

This  suggests  that  we  define  the  mapping  n:  TL  *♦  Z  by 

tt  r  (i1,  i2)  ]  -  j1  =  i1  +  i2  . 

Then  the  execution  for  P  e  tP  precedes  the  execution  for  Q  e  (P  if  and 

only  if  tt(P)  <  tt(0).  if  it{P)  =  tt(Q),  then  the  two  executions  of  the  loop 
•  • 
body  are  concurrent. 

The  generalization  of  this  rewriting  procedure  is  straightforward. 

Loop  (1)  will  be  rewritten  in  the  form 

(h)  DO  a  J1  »  X1.  u1 


where 


a 


DO  ajk  =  Xk,  uk 

DO  a  CONC  FOR  ALL  (Jk+1,  . . . ,  f)  e  ^ 

J  •  •• • »  r 


loop  body 


CONTINUE 


jlc 

#  •  •  •  >  j 


.n-k 


is  a  subset  of  Z  which  may  depend  upon  the  values 


lk  i  i 

of  J  ,  . .  .  J  .  Here.  X  and  u  need  not  be  integers,  but  may  be  integer 
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valued  expressions  whose  values  depend  upon  J^,  ...» 

To  perform  this  rewriting,  we  will  obstruct  a  one-to-one  mapping 
J:  2n  ■*  2n  of  the  form 


(5)  J[  (I1 . In)] 


(J»1  lJ'  a”  J,)  "  Q1'  •••»  J"> 


for  integers  aj.  *  We  then  choose  the  X1,  u1  and  2t  .  so  that  the 

-  J  ,  •  •  • ,  J* 

index  set  J^of  the  loop  (4)  equals  J  (J)) ,  and  write  the  body  of  loop  (4)  so 

that  its  execution  for  the  point  J  (P)  ej?  is  equivalent  to  the  execution  of 
the  body  of  loop  (1)  for  P  e  J . 


Define  the  mapping  tt:  2n  -*  2^  by 

"C  (I1 . i”)  ]  *=  (J1 . j*)  , 

so  tt(P)  consists  of  the  first  k  coordinates  of  J  (P) .  It  is  then  clear  that 

for  any  points  J  (P) ,  J  (0)  ,  the  execution  of  the  body  of  loop  (4)  for 

J  (P)  precedes  the  execution  for  J  (Q)  if  and  only  if  tt  (P)  <  tt  (Q).  if 


we  consider  loop  (4)  to  be  a  reordering  of  the  execution  of  loop  (1),  this 
statement  is  equivalent  to  the  following: 


For  any  P,  0  e  S .  the  execution  of  the  loop 
body  for  P  precedes  that  for  O .  in  the  new  ’ 
grderlng  of  executions  .  if  ancTnniy  if 
31  (P)  <  tt  (O) . 


J  is  one-to-one  if  and  only  if  (5)  can  be  solved  to  write  the  I*  as  linear 
expressions  in  the  J*  with  integer  coefficients. 


The  loop  body  is  executed  concurrently  for  all  elements  of  v* P 
lying  on  a  set  of  the  form  (P:  n(P)  =  constant  e  Zk  }  .  Since  J  is 
assumed  to  be  a  one-to-one  linear  mapping,  these  sets  are  parallel 
(n-k)-dimensional  planes  in  Zn  .*  We  thus  have  concurrent  execution 
of  the  loop  body  along  (n-k)-dimenslonal  planes  through  the  index  set. 

Naturally,  we  cannot  use  any  arbitrary  mapping  J.  We  must  find' 
one  for  which  loop  (4)  gives  an  algorithm  equivalent  to  that  of  loop  (l). 
This  is  the  goal  of  the  following  analysis. 

Observe  that  rewriting  loop  (1)  so  all  executions  are  concurrent; 
l.e. ,  with  a 

DO  o  CONC  FOR  ALL  (I1 . l")  c  J/ 

Btatement.  Invokes  setting  J  equal  to  the  Identity  mapping,  k  -  0, 
and  m  Zn<  1°  the  mapping  defined  by  n(P)  •  0  for  all  Pci". 


*We  consider  Zn  to  be  a  subset  of  ordinary  Euclidean  n-space,  as 
we  did  in  drawing  Figures  2  and  3 . 


IV.  THE  BASIC  RULE 


[ 


We  first  Introduce  some  terminology  to  aid  the  discussion. 
Consider  the  variable  VAR  defined  by  the  statement 

DIMENSION  VAR  (10,  20)  . 

The  range  of  VAR  Is  the  set  VAR  =  {  (x,  y):  1  ax  s  10,  1  s  y  *  20)  , 

A 

which  Is  a  subset  of  Z  .  Thus,  @  yAR  Is  the  set  of  all  (x,y)  e  Z2 
such  that  VAR  (x,y)  Is  defined.* 

An  occurrence  of  VAR  Is  any  appearance  of  It  In  the  loop  body. 
If  the  occurrence  appears  as 

VAR  (-,-)  =  ..., 

then  It  Is  called  a  generation;  otherwise  It  Is  called  a  use.  I.e. , 


It 


generations  modify  the  values  of  elements  of  the  array  VAR,  and  uses 
do  not. 


I 

t 


Jus,  .  * 


ry  j  - : 


*' 1 


Let  f  denote  an  occurrence  of  VAR  In  loop  (1)  of  the  form 
1  3 

VAR  (I  +3,1),  and  assume  n  =  3.  During  execution  of  the  loop  body 
for  the  element  (3,  4,  5)  e  Jf ,  this  occurregce  becomes  VAR  (6,  5)  . 

We  say  that  f  references  the  point  (6,  5)  e  4?VAR  for  (3,  4,  5). 


This  defines  an  occurrence  mapping  -»  & 


VAR 


♦For  a  scalar  variable  x,  we  set  -^?x  *  Zl 
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letting  T^fP)  be  the  point  of  referenced  by  f  f or  P  e  J)  . 

In  this  case,  Tf  is  given  by 

Tf[  (p*.  p2,  p3)]  -(p'ts,  p3)  . 

We  will  assume  that  all  variable  occurrences  only  have  the 
loop  variables  I1,  . . . ,  In  and  integer  constants  in  their  subscript 
expressions.  Then  for  any  variable  occurrence  g,  the  occurrence 
mapping  Tg:  Zn-*  is  well-defined,  where  m  is  the  dimension 
(number  of  subscript  positions)  of  the  variable. 

Now  consider  the  loop 

(6)  DO  23  I1  =2,  10 
DO  23  I2  =3,  17 

21  A  (I1,  I2)  =6  (I1) 

@  © 

22  B(I*,  I2}  -  A  (I1  -1,  I2  +  1)  +  B  (I1,  l2  ) 

©  ©  © 

23  CONTINUE. 

We  have  introduced  the  convention  of  writing  the  name  of  an  occurrence 
in  a  circle  beneath  it.  For  the  point  (4,  7)  e  ,  the  loop  body  is 

21  A  (4,  7)  =  C  (4) 

22  B  (4,  7)  (3,  8)  +  B  (4,  7)  . 

The  value  A  (3,  8)  used  in  statement  22  is  the  one  computed  in  statement 
21  during  execution  of  the  loop  body  for  the  point  (3,8).  To  ensure  that 


v 


the  execution  for  (4,7)  computes  the  right  value  when  we  change  the  order 
of  executions  of  the  body,  we  must  only  require  that  it  be  preceded  by  the 
execution  for  (3,8).  By  statement  E  above,  this  means  that  n  must 
satisfy  TTl!  (3,  8)  ]  <n[  (4,  7)  ]  . 

In  general,  let  VAR  be  any  variable.  If  a  generation  and  a  use  of 
VAR  both  reference  the  same  element  in  the  range  of  VAR  during  execution 
of  the  loop,  then  the  order  of  the  references  must  be  preserved.  In  other 
words,  if  f  is  a  generation  and  g  is  a  use  of  VAR,  and  T|(P)  *  Tg(Q) 
for  some  points  P,  Q  e  ,  then: 

(I)  If  P  <  0/  we  must  have  rr(P)  <  tt(Q)  , 

(II)  If  Q  <  P,  we  must  have  rr(0)  <  tt(P)  . 

In  the  above  example,  T  T  (3,8)  1  *  T_  [  (4,7)  ]  “(3,8)  and 

al  a2 

(3 , 8)  <  (4,7),  so  we  must  have  tt  C.  (3 , 8)  ")  <  n  [  (4 , 7)  1 .  Note  that  If 

P  “  Q,  then  the  order  of  execution  of  the  references  will  automatically  be 

preserved,  since  they  happen  during  a  single  execution  of  the  loop  body. 

Thus,  the  fact  that  T.  T  (4,7)  ]  “  T.  I"  (4,7)  ]  does  not  place  any 

°  1  °2 

restriction  on  our  choice  of  tt. 

The  above  rule  should  also  apply  to  any  two  generations  of  a 
variable.  This  guarantees  that  the  variable  has  the  correct  values  after 
the  loop  Is  run.  Together  with  the  above  rule,  it  also  ensures  thst  a  use 
will  always  obtain  the  value  assigned  by  the  correct  generation. 

These  remarks  can  be  combined  into  the  following  basic  rule: 
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(Cl)  For  every  variable,  and  every  ordered  pair  of  occurrenrps 
f,g  of  that  variable,  at  least  one  of  which  Is  a  generation:  if 

Tf(P)  =  Tq(Q)  for  P,  Q  e  J)  with  P  <  Q,  then  n  must  satisfy  the  relation 
tt(P)  <  tt(O)  . 

m 

Notice  that  the  case  0  <  P  Is  obtained  by  Interchanging  f  and  g. 

Rule  Cl  ensures  that  the  new  ordering  of  executions  of  the  loop 
body  preserves  all  relevant  orderings  of  variable  references,  The 
orderings  not  necessarily  preserved  are  those  between  references  to 
different  array  elements,  and  between  two  uses.  Changing  just  these 
orderings  cannot  change  the  value  of  anything  computed  by  the  loop. 

The  assumptions  we  have  made  about  the  loop  body,  especially  the 
assumption  that  It  contains  no  exits  from  the  loop,  therefore  Imply  that 
rule  Cl  gives  a  sufficient  condition  for  loop  (2)  to  be  equivalent  to  loop 
(1).* 


For  most  loops.  Cl  Is  also  a  necessary  condition. 


The  trouble  with  rule  Cl  is  that  it  requires  that  we  consider  many 

pairs  of  points  P,  Q  in*-/.  For  the  loop  (6),  there  are  112  pairs  of 

elements  P,  Q,  ej?  with  T  (P)  =  T  (Q)  and  P  <  Q.  However. 

12. 

Ta/P)  =  Ta,(Q)  lf  and  only  lf  0  =  P  +  (1 ,  -1).  We  would  like  to  be  able 

X  z 

to  work  with  the  single  point  (1.  -1)  e  Zn,  rather  than  all  the  pairs  P,Q. 

This  suggests  the  following  definition.  For  any  occurrences  f, 
g  of  a  variable  in  loop  (lj,  define  the  subset  <f,  g>  of  Zn  by 

<f,  9>  =  {  X  :  Tf(P)  =  T  (P+X)  for  some  P  €  Zn  }  . 

Thus,  for  loop  (6)  we  have  <al ,  a2>  =  f  (1 ,  -1)  ) ,  and 

<bl ,  b2>  -  {  (0,0)  } .  Observe  that<f,  g>  is  independent  of  the  index 
set  dP  . 

We  now  rewrite  rule  Cl  in  terms  of  the  sets  <  f,  g>.  First, 
note  that  rr  (P  +  X)  =  tt  (P)  +  tt  (X) ,  since  we  have  assumed  n  to  be  a 
linear  mapping.  (Recall  the  definition  of  tt,  and  formula  (5).)  Also, 
remember  that  A  <  A  +  B  if  and  only  if  B  >  0.  Then  just  substituting 
P  +  X  for  Q  in  rule  Cl  yields: 

(Cl* )  For  ...  generation;  if  Tf  (P)  =  T^  (P  +  X)  for 
P#  P  +  X  ekl  with  X  >  0,  then  tt  must  satisfy 
the  relation  tt(X)>  0. 

Removing  the  clause  "for  P,  P  +X  e  Jp  "  from  Cl'  gives  a  stronger 
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Any  tt  satisfying  C2  also  satisfies  Cl.  Hence,  rule  C2  gives 
a  sufficient  condition  for  loop  (4)  to  be  equivalent  to  loop  (1).  Moreover, 
C2  is  Independent  of  the  index  set  JP . 

Note  here  that  C2  is  satisfied  by  the  zero  mapping  tt:  2n-*  Z® 
if  and  only  if  it  is  vacuous;  i.e. ,  if  and  only  if  there  are  no  elements 
X  >  0  in  any  of  the  sets  <f,g>  referred  to  in  the  rule.  In  this  case, 
the  loop  body  can  be  executed  concurrently  for  all  points  in  . 
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VI.  COMPUTING  THE  SETS  <f,q> 

We  will  obtain  results  about  the  existence  of  mappings  n 
satisfying  rule  C2.  In  order  to  do  this,  some  restrictions  must  be 
made  on  the  forms  of  the  occurrences  to  permit  a  simple  computation 
of  the  sets  <f,g>.  We  assume  that  each  occurrence  of  a  variable 
VAR  is  of  the  form 

J,  i  L  r 

(7)  VAR  (I  +  m1 ,  . . . ,  I-  +  m  r)  , 

k 

where  the  m  are  Integer  constants ,  and  Jj,  . ..,  Jr  are  r  distinct 
integers  between.  1  and  n.  Moreover,  we  assume  that  the  are  the 

same  for  any  two  occurrences  of  VAR.  Thus ,  if  an  occurrence 

2  14  2 

A  (I  -1,1,1  +1)  appears  in  the  loop,  then  the  occurrence  A  (I  +  1, 

14  12  4 

I  +  6,  I  )  may  appear.  However,  the  occurrence  A  (I  -1,1,1) 

may  not  . 

Now  let  f  be  the  occurrence  (7)  and  let  g  be  the  occurrence 

Ji  j  Jr  r 

VAR  (I  +  n  ,  . . .  ,>I  r  +  nr)  .  Then 

Tf  [  (p1 ,  ....  pn)  ]  =  (p*1  +  mx ,  . . . ,  pT  +  mp 

Tg[  (p1,  .  pn)  ]  =(pJl  +nx,  ...,  pJf  +  np 

It  is  easy  to  see  from  the  definition  that  <  f,g>  is  the  set  of  all 
elements  of  Zn  whose  coordinate  is  -  nk,  for  k  -  1,  . . . ,  r, 
and  whose  remaining  n  -  r  coordinates  are  any  integers . 
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I  V  • 

[ 

As  an  example,  suppose  n  =.  5  and  f,g  are  the  occurrences 
J  VAR  (I3  +  1,  I2  +  5,  I5),  VAR  (I3  +1,  I2,  I5  +1).  Then<f.g>  = 

.  [  (x#  5,  0,  y,  -1) :  x,  y  e  2}  .Wewill  denote  this  set  by  (*,  5,  0,  *,  -1), 

l  so  means  "any  integer". 

(The  index  variable  I*  is  said  to  be  missinq  from  VAR  if  I*  is 

Jk  i 

not  one  of  the  I  in  (7).  It  is  clear  that  V  is  missing  from  VAR  if  and 
1  only  if  the  set  <f,g>  has  an  *  in  the  J—  coordinate,  few  any  occurrences 

r  f,g  of  VAR. 


li 

I 

£  VII.  THE  HYPERPLANE  THEOREM 

£  1^  Is  called  a  missing  index  variable  if  it  is  missing  from 

some  generated  variable  in  the  loop;  i.ea/  if  it  is  missing  from  some 
1  variable  for  which  a  generation  appears  in  the  loop  body. 

The  following  result  is  an  important  special  case  of  a  more 

*  general  result  which  will  be  given  later.* 

*  Hyperplane  Concurrency  Theorem:  Assume  that  none  of  the  index 

I  variables  if ,  ....  In  is  a  missing  variable.  Then  loop  (1)  can  be 

rewritten  in  the  form  of  loop  (4)  for  k  =  1 .  Moreover,  the  mapping 
£  I  used  for  the  rewriting  can  be  chosen  to  be  independent  of  the  index 

*  set  <jP. 

Proof:  First,  a  mapping  tt:  2Zn -♦  Z  will  be  constructed  which  satisfies 
1  rule  C2.  Let  P  be  the  set  consisting  of  all  the  elements X  >  0  of  all 

■  the  sets  <f,g>  referred  to  in  C2 .  We  must  construct  tt  so  that 

tt(X)  >  0  for  all  x  e  P  . 

1  Let  "+"  denote  any  positive  integer,. so  (+,  x  ,  ,  x  )  is  any 

n  o  n  1 

element  of  Z  of  the  form  (x,  x  ,  ...,  x  )  with  x>  0.  Since  I 


*A  weaker  version  of  this  result  can  be  found  in  [2]  . 
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is  the  only  Index  variable  which  may  be  missing,  we  can  write 
7^  =  {Xj ,  . . . ,  Xn  ]  ,  where 

r  (x*, . . . ,  x"> 


(+.  xr2 . x”) 


for  some  integers  x*  . 


The  mapping  tt  is  defined  by 

(8)  tt[  (I1 . In)  ]  =a.  I1  +  ...  +a  In 

i  n 


for  non-negative. integers  aJ7  to  be  chosen  below.  Since  aj  S  0, 
n[  (  1,  x^,  . . . ,  x")  ]  >  0  implies  n[  (x,  xj?,  . . . ,  x")  ]  >  0  for 
any  integer  x  >  0  .  Therefore,  each  Xf  of  the  form  (+,  x*,  . . . ,  x") 

can  be  replaced  by  Xf  =  (1,  xf . xj1)  and  it  is  sufficient  to 

construct  tt  such  that  tt  (xp  >  0  for  each  r=l,  ...,N.  Recall 
that  each  Xr  >  0. 

Define  fi  ^  -  {Xf  :  x*  =  . . .  »  xj"1  -  0,  xj.  f  0  } ,  so 
is  the  set  of  all  Xr  whose  coordinate  is  the  left-most  non-zero 
one.  Then  each  Xf  is  an  element  of  some  fiy 

Now  construct  the  a^  sequentially  for  J  =  n,  n  -  1 ,  . . . ,  1 
as  follows.  Let  aj  be  the  smallest  non-negative  integer  such  that 

Oj  xj  +  . . .  +  an  xj1  >  0 

for  each  Xf  —  (0,  ...,  0 ,  x^ ,  ...,  x^)  e  /j.  Since  X^>  jO  and 
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x*  fO  Implies  xj.  >  0  ,  this  Is  possible. 

Clearly,  we  heve  n  Oy  >  0  for  all  Xf  c  Z7..  But  each 
*r  Is  in  some  Z’j,  so  n  (Xp  >  0  for  each  r-i.  Thus,  „ 

satisfies  rule  C2.  Observe  that  the  first  non-zero  3j  that  was  chosen 
must  equal  1,  so  1  Is  the  greatest  common  divisor  of  the  aj.  (If  all 

the  a^  are  zero,  then  />  must  be  empty,  so  we  can  let  tt  [  (I1,  .. .,  i")  ] 
“I  .)  A  classical  number  theoretic  calculation,  described  on  Page  31~ 
of  [3] ,  and  reproduced  in  Appendix  A,  then  gives  a  one-to-one  linear 
mapping  J;  Zn-*  Zn  such  that 

no1. .... i") 3  =  („cti1 . i" ) ], 

Since  the  sets  <f,g>  are  independent  of  the  index  set  ^ , 
the  construction  of  tt  and  J  given  above  is  also  Independent  of  Jp . 

This  completes  the  proof.  Q 

Loop  (4)  f or  k  =  1  executes  the  loop  body  concurrently  for  all 
points  In  J  lying  along  parallel  (n  -  l)-dimenslonal  hyperplanes, 
hence  the  name  of  the  theorem. 

Observe  that  the  theorem  is  trivially  true  without  the  restriction 
that  J  be  independent  of  J  ,  because  glven'any  setj  we  can  construct 
a  J  for  which  the  s*ts  contain  at  most  one  element, 

and  the  order  of  execution  of 'the'  Yoop  body  is  unchanged.  For  example  , 
tf  ^  ^  ls  xs  10,  lays  5,  Is  z  s?  7  },  let  J  [  (x,y,z)  ]  = 

(35  x  +  7  y  +  z,  x,  y).  Such  a  J  is  clearly  of  no  interest.  However, 
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because  the  mapping  J  provided  by  the  theorem  depends  only  on 
the  loop  body,  it  will  always  give  real  concurrent  execution  for  a 
large  enough  index  set. 

Condition  C2  gives,  a  set  of  constraints  on  the  mapping 
tt:  Zn-*  2.  The  Hyperplane  Theorem  proves  the  existence  of  a  tt 
satisfying  those  constraints.  We  now  consider  the  problem  of  making 
an  optimal  choice  of  tt. 

It  seems  most  reasonable  to  minimize  the  number  of  steps  in 
the  outer  DO  J1  loop  of  loop  (4) .  (Remember  that  k  =  1.)  if  a 
sufficiently  Targe  number  of  processors  are  available,  then  this  gives 
the  maximum  amount  of  concurrent  computation.  This  means  that  we 
must  minimize  p1  -  X1  in  loop  (4).  But  X1  and  p1  are  Just  the 
upper  and  lower  bounds  {it  (P) :  P  e  J  } .  Setting 

M1  =  p1  -  * 1  , 

it  is  easy  to  see  that  u1  -  X1  equals 

(9)  M1  j  aj  j  +  ...  +Mn  j  an  |  , 

where  the  are  defined  by  (8).  Finding  an  optimal  tt  Is  thus 
reduced  to  the  following  integer  programming  problem:  find  Integers 

al . an  satlsfylng  the  constraint  inequalities  given  by  rule  C2, 

which  minimize  the  expression  (9). 

Observe  that  the  greatest  common  divisor  of  the  resulting  at 
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must  be  1.  This  follows  because  the  constraints  are  of  the  form 
X6j  +  ...+xan>0, 

so  dividing  the  a^  by  their  g.c.d.  gives  new  values  of  a^  satisfying 
the  constraints,  with  a  smaller  value  for  (9).  '.Hence,  the  method  of 
[3]  can  be  applied  to  find  the  mapping  J. 

Although  the  above  Integer  programming  problem  Is  algorithmically 
solvable,  we  know  of  no  practical  method  of  finding  a  solution  In  the 
general  case.  However,  the  construction  used  in  proving  the  Hyperplane 
Theorem  should,  provide  a  good  choice  of  tt  ,  In  fact,  for  most 
reasonable  loops  It  actually  gives  the  optimal  solution. 


VIII.  AN  EXAMPLE 


Now  consider  the  following  loop: 


(10)  DO  16  I1  =1,25 

•  DO  16  I2  =  2,  19 

DO  16  I3  =  2,  29 

F  (I2,  I3)  =  (F  (I2  +  1,  I3)  +  F  <I2,  I3  +  1) 

®  ©  © 

x  +F  a2  -  1.  I3)  +F  (I2,  I3  -1)  )  *  .25 

@  @ 

16  CONTINUE  . 

It  Is  a  simplified  version  of  a  standard  relaxation  computation  for  a 
20  by  30  array  F,  performed  25  times. 

To  apply  the  method  of  analysis,  first  perform  the  following 
calculations: 

1.  Compute  the  sets  <f,g>  referred  to  by  rule  C2. 

2.  Find  all  elements X  >  0  In  these  sets. 

3.  Find  the  constraints  on  the  a^  Implied  by  rr(X)  >  0  . 

This  Is  done  In  Table  1. 

Next,  choose  .a^,  aj,  a3  consistent  with  these  constraints. 
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Table  1 


and  minimizing 


M  |  ax  |  +  17  |  a2  |  +  27  I  a3  |  . 

It  is  easy  to  see  that  the  solution  to  this  problem  is  =  2 ,  a3  *  1, 
a3  15  1,  so  tt  is  given  by 

ir[  (I*.  I2-  I3  )  1  =  2  I1  +  I2  +  I3  . 

Note  that  this  is  the  tt  computed  by  the  algorithm  used  in  the  proof  of 
the  Hyperplane  Theorem. 

Application  of  the  algorithm  described  in  the  appendix  gives  the 
following  mapping  J  : 

jt  (i1.  i2.  i3)  ]  =  a1,  i2.  j3)  =(2 11  +i2  +i3.  i1.  i3).* 

Using  this,  and  the  Inverse  relation 

(I1.  I2,  I3)  =  (J2,  J1  -  2J2  -  J3,  I3)  , 
the  above  loop  is  rewritten  as  follows: 

DO  16  J1  =  6,  98 

DO  16  CONC  FOR  ALL  (J2,  J3)  e  {  (x,y): 

X  1  s  x  s  25,  2  s  y  s  29  and  J1  -  19  s  2x  +  y  s  j1  -  2  ) 

Fd1  -  2  *  J2  -  J3 ,  J3)  =  (F  (J1  -  2  *  J2  -  J3  +  1,  J3)  + 

X  F  (J1  -  2  *  J2  -  J3 ,  J3  +  1)  +  F  ft1  -  2  *  J2  -  J3  -  1 ,  J3) 

♦It  is  also  easy  to  obtain  this  from  the  following  fact:  the  mapping 
J:  azn-»2n  defined  by  (5)  is  one-to-one  if  and  only  if  the  determinant 
of  the  aj  is  ±  1  . 
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X 


+F01-2.J2-j3,J3_1)).  2s 

16  CONTINUE  . 

The  se,  expression  In  ,he  DO  CONC  statement  wes  obtained  by 
writing  the  relations  1  *  I1  s  15.  2  *  ,*  *  „  ,and  2  s  ,3  , 
terms  of  J1,  j2,  j3  . 

TO  understand  why  the  rewritten  loop  gives  the  same  results 

consider  the  computation  of  F  (4  fil  in 

>4-  6)  ln  the  execution  of  the  original 
loop  body  for  the  element  (9,4  s|,^  , 

,  l  .  4.  6)  e  ^  .  It  i,  set  equal  to  the  average 

s  our  neighboring  array  elements:  F  (5,  6),  F  (4,  7),  F  (3  6) 

l"'  ^ '  Tte  ValUeS  °f  P  <5-  6>  «*  F  <4,  7)  were  calculated  during 
e7Uti°n  °f  l00P  b“»  <3-  3.  end  (8,  4,  7,.  respectively 

•e. ,  during  die  previous  execution  of  the  DO  I1  loop,  with  l1  -  8 

The  values  of  F  (3 ,  6,  and  F  ,4.  S)  were  calculated  during  the  current 

execution  of  the  outer  DO  loop,  with  I1  =  9  Thl<s  lc  . 

9 .  This  is  shown  in  Figure  3 . 

Now  consider  the  rewritten  loop.  At  any  „me  during  its 
execution,  F  (p.q)  is  being  computed  concurrently  for  up  to  half  die 
e  ements  (p.q,  in  the  range  set  £  p  of  F  .  These  computations  are 

M  Va!"es  0,1  •  «  "‘ustrates  the  execution  of  the 

DO  CONC  for  J  -27.  The  points  (p.q,  ,  ^  for  which  F  (p.q,  is 

elng  computed  are  marked  with  »x"s.  and  the  value  off1  for  the 

computation  is  indicated.  Figure  5  shows  the  same  thing  for  =  28. 

how  the  values  being  used  in  the  computation  of  F  (4,  6) 
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In  Figure  5  were  computed  In  Figure  4.  A  comparison  with  Figure  3 
illustrates  why  this  method  of  concurrent  execution  Is  equivalent  to 
the  original  algorithm. 

The  saving  In  execution  time  achieved  by  the  rewriting  will 

depend  upon  the  amount  of  overhead  In  the  Implementation  of  the  DO 

CONC,  as  well  as  the  actual  number  of  processors  available.  (The 

sets  In  the  DO  CONC  statement  contain  up  to  252  elements.)  However, 

the  value  of  this  approach  Is  indicated  by  the  fact  that  the  number  of 

sequential  Iterations  has  been  reduced  by  a  factor  of  over  135.  (However, 

we  must  point  out  that  a  real  program  would  probably  Include  a 

convergence  test  within  the  outer  DO  I*  loop,  so  the  analysis  could 

2  3 

only  be  applied  to  the  Inner  DO  I  /DO  I  loop.) 


IX.  THE  GENERAL  PLANE  THEOREM 


We  now  generalize  the  Ilyperplane  Theorem  to  cover  the  case 

2  n 

when  some  of  the  Index  variables  I  ,  . . . ,  In  are  missing.  Concurrent 
execution  is  then  possible  for  the  points  in  flying  along  parallel 
planes.  Each  missing  variable  may  lower  the  dimension  of  the  planes 
by  one. 

Plane  Concurrency  Theorem:  Assume  that  at  most  k  -  1  of  the  index 
variables  . . . ,  ln  are  missing.  Then  loop  (1)  can  be  rewritten  in 
the  form  of  loop  (4).  Moreover,  the  mapping  J  used  for  the  rewriting 
can  be  chosen  to  be  independent  of  the  index  set  . 


Proof:  The  proof  is  a  generalization  of  the  proof  of  the  Hyperplane 

Theorem.  Let  I  ,  . . . ,  I  be  the  possibly  missing  variables  among 
2  *  n 

I  i  •  •  •  i  I  •  Set  Jj  —  n  1 1  and  assume  ^  Jg  ^  ^  Jj^  ^ 

Let  fi  be  the  set  of  all  elements  X  >  0  of  all  the  sets  <f,g> 
referred  to  by  rule  C2  .  We  must  construct  tt  so  that  n  (X)  >  0  for  all 
x  e  i P  .  Let  '/j  =  {  (0,  . . 0,  x*,  ...»  xn)  e  Pi  x  >  0} ,  so  P ^ 

Is  the  set  of  all  elements  of  P  whose  J—  coordinate  is  the  left-most 
non-zero  one.  Then  every  element  of  P  is  in  one  of  the  . 

The  mapping  tt:  Zn-+  2Zkwill  be  constructed  with 

7T  (P)  =  {  rr1  (P) . TTk<P)), 

where  each  nl  :  2n-*  2  will  be  defined  by 


TT1  [  (I1 


,l")]  =  a|  I1  +  ...  +  a‘  ln 
i  n 


for  non-negative  integers  aj  .  Moreover,  we  will  have  aj  =  0  if  j  <  or 
J£  Ji+l*  This  Implies  that  Is  X  e  Pj  and  J  >  J1+j.  Then  tt^X)  «  0.  It  there¬ 
fore  suffices  to  construct  tt*  so  that  for  each  j  with  J  <  J  <  j  ,  and  X  e  P  : 

i  l  1+1  j* 

-tt  (X)  >  0  -  for  we  then  have 

TtCX)  =  (0, . . .  ,0,  tt1  (X) , . . . ,  nk(X)  )  >  0. 

Recall  that  for  the  sets  <f,g>,  an  *  can  appear  only  in  the 
Jl#  •  •  •  /  Jjj  coordinates .  Thus  any  element  of  any  of  the  sets  with 
U  <  S  <J1+ .  can  be  represented  in  the  form 

ATA  411 


(0, x 


,^1+  1  *, . . .)  ,  or 


(0 , . . . ,  0 ,  +,  xj1+1,...,  xj1+1  \ ...) 

for  a  finite  collection  of  integers  xj.  ,  ^  <  j  <  J1+r  By  the  same  argument 
used  In  the  proof  of  the  Hyperplane  Theorem,  we  can  replace  "+"  byx^1  =1, 
and  choose  aj>  0,  <  j  <  J1+J  such  that 

a,  x  +  . . .  +  a.  -  x  1+1  >0 

Jl+1"1 

for  each  r.  Choosing  a^  =  0  for  J  <  Jj  and  J  >  J1+i  completes  the  construction 
of  the  required  tt*  . 

The  construction  given  in  Appendix  A  is  then  applied  to  give 
invertable  relations  of  the  form 


>i  A  M 


J  *=  a.  I  +. . .+  a.  -  I 

h  W1 


V  «  z 


Vr1 


bJ  f 


for  Jj  <J  <j1+1 
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Combining  these  and  reording  the  J 1  gives  the  required  mapping  J.  0 

As  in  the  hyperplane  case,  to  get  an  optimal  solution  we  want 
to  minimize  the  number  of  iterations  of  the  outer  DO  loops.  This  means 
minimizing  (a1  -  X1  +  1)  ...  (|jk  -  Xk  +  1).  Using  the  notation  of  (5),  it  is 
easy  to  verify  that  this  number  is  equal  to 

(U)  (M1  J  a}  |  +  ...  +Mn  1  aj>  J  +  1  )  ...  (M1  |  ak  |+...  +  Mn|ak|+1), 
where  M  =  u  -  i  . 


Finding  the  aj  is  now  an  integer  programming  problem.  Note 
ha  t  a  solution  with  a^  =  . . .  =  a^  =  0  for  some  i  gives  a  solution  to  the 
rewriting  problem  with  k  replaced  by  k-1,  since  that  tt1  can  be  removed 
without  affecting  the  constraint  inequalities .  The  Plane  Concurrency  Theorem 
proves  the  existence  of  a  tt:  *n  -Zk  satisfying  C2,  for  a  particular  value 
of  k.  It  may  be  possible  to  find  such  a  tt  for  a  smaller  k. 

For  completeness,  we  will  state  a  sufficient  condition  for  the 
loop  body  to  be  concurrently  executable  for  all  points  in*^  .  This  is  the  case 
when  the  zero  mapping  (tt  (P)  =  0  for  all  P  e  Zn)  satisfies  C2.  Since  <g,f>  = 

{  -X  :  X  C<f,g  >  )  ,  it  is  clear  that  this  is  true.if  and  only  if  all  the  sets 
<f,  g>  are  equal  to  (0) .  Finally,  the  rules  for  computing  the  sets  <f,g> 
give  the  following  rather  obvious  result: 


If. none  Oi  the  Index  variables  are  missing,  and  for  any  generated 
variable,  all. pccurrences_pf_that  variable  are  identical,  then  loop  U)  can  be 
rewritten  as  a 

DO  a  CONC  FOR  ALL  (I1, . . .  ,In)  c  *0 

loop. 
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The  hypothesis  means  that  in  the  expression  (7)  for  any 
generated  variable  VAR,  r  =  n  and  the  m  are  the  same  for  all  occurrences  of 
VAR. 


CHAPTER  HI 


SYNCHRONOUS  PROCESSORS 


5-/-/? 
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1  •  The  DO  SIM  S_  dement 

We  now  consider  the  case  of  completely  synchronous  processors, 
the  primary  example  being  the  ILLIAC-IV.  To  accomodate  it,  let  us  introduce 
the  DO  SIM  (for  Simultaneously)  statement,  having  the  following  form: 

DO  a  SIM  FOR  ALL  (I1,...,Ik)  c  ^ , 

where  J^is  a  subset  of  Z*.  Its  meaning  is  similar  to  that  of  the 
DO  CONC  statem^it,  except  that  the  computation  is  performed  synchronously 
by  the  individual  processors.  Each  point  of  x/is  assigned  to  a  separate 
processor,  and  each  statement  in  the  range  of  the  DO  SIM  is,  in  turn, 
simultaneously  executed  by  all  the  processors.  An  assignment  statement 
is  executed  by  first  computing  the  right-hand  side,  then  simultaneously 
performing  the  assignment. 

As  an  example,  consider 

DO  15  SIM  FOR  ALL  I  e  {  x:  2  <  x  <_10  } 

14  A  (I)  =  A(I-l)  +  B(I) 

15  B(I)  *=  A  (I)  **  2 

The  right-hand  side  of  statement  14  is  executed  tor  all  I  before  the  assignment 
of  A(I)  is  made,  and  before  statement  15  is  executed.  Therefore,  if  initially 
A(4)  c  5  and  B(5)  =  2,  then  executing  the  loop  sets  A(5)  =  7  and  B(5)  =  49. 

Because  of  the  simultaneity  of  execution  of  the  body  for  the 
various  points  of  >] ,  we  cannot  allow  any  conditional  transfer  of  control  in 
the  loop  body  which  depends  upon  the  index  variables.  E.  g. ,  the  statement 

IF  (A (I))  3,  4,  5 

may  not  appear  in  a  "DO  SIM  FOR  ALL  I"  loop. 


For  simplicity,  assume  that  there  is  no  transfer  of  control  within 
the  body  of  the  loop,  so  the  statements  are  always  executed  sequentially 

In  the  order  in  which  they  appear. 

We  will  allow  conditional  assignment  statements  such  as 
IF  (A (I)  .EQ.  0)  B(I)  =  3. 

They  are  easily  implemented  on  the  ILLIAC-IV  because  of  its  ability  to  turn 
off  individual  processors. 

The  only  other  restriction  to  be  made  on  the  body  of  a  DO  SIM  loop 
Is  that  a  generation  may  not  reference  the  same  array  element  for  two  different 
points  in  its  index  set  J  .  I.e. ,  an  assignment  statement  may  not  have 
two  different  processors  simultaneously  storing  values  into  a  single  memory 
location.  We  do  allow  them  to  simultaneously  load  a  value  from  a  single 
memory  location,  so  this  restriction  is  not  made  for  uses  of  a  variable.  * 


*  Simultaneous  loads  from  a  single  memory  location  are  implemented  in  the 
ILLIAC-IV  by  the  ability  o L  the  central  control  unit  to  broadcast  a  value  to  all 
processors . 
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II.  Rewriting  the  Loop 


Now  consider  the  problem  of  rewriting  the  given  loop  (1)  in  the 

form 


02)  DO  or  J1  —  X1 ,  p1 


DO  ajk  =  \k,  Mk 

DO  a  SIM  FOR  ALL  (Jk+1 , . 

loop  body  | 


Jn ) e  J) 1 


J* 


a  CONTINUE 


This  is  the  same  as  loop  (4) ,  except  the  DO  CONC  is  replaced  by  a  DO  SIM . 
We  assume  that  the  body  of  loop  0)  contains  no  control  transfering  statements 
-  i.e.,  no  GO  TO  or  numerical  IF  statements. 

Define  tl  i  mappings  J  and  tt  as  before.  Any  DO  CONC  statement 
can  be  executed  as  a  DO  SIM,  since  it  must  give  the  same  result  if  the 
asynchronous  processors  happen  to  be  synchronized.  Thus,  the  rewriting 
could  be  done  just  as  before  by  finding  a  tt  which  satisfies  C2.  However, 
the  synchrony  of  the  computation  will  allow  us  to  weaken  the  condition  C2 . 

Recall  that  rule  Cl  was  made  so  that  the  rewriting  will  preserve 
the  order  in  which  two  different  references  are  made  to  the  same  array 
element.  For  references  made  during  two  different  executions  of  the  loop  body, 
the  asynchrony  of  the  processors  requires  that  the  order  of  those  executions 
be  preserved.  However,  with  synchronous  processors,  we  can  allow  the 


t"°  '00P  b°dy  eXeCUtJ°nS  t0  »*  d°"e  simultaneously  If  the  references  w,U  then 
be  made  In  the  correct  order.  The  order  of  these  two  references  is  determined 

by  the  positions  within  the  loop  body  of  the  occurrences  which  do  the 
referencing . 

for  two  occurrences  f  and  g.  le,  f  «g  denote  that  the  execution 
Of  f  precedes  the  execution  of  g  within  the  loop  body.  This  means  that  either 
U.e  statement  containing  ,  precedes  the  statement  containing  g.  or  else  .ha, 
f  is  a  use  and  g  a  generation  in  the  same  statement.  The  above  observation 
allows  us  to  change  ruled  to  the  following  weaker  condition  on 
Eor  . . .  generation:  (p)  =  (Q 

P>  ^  r — with  P  <  Q,  then  we  must,  have  either 
- TT  (P)  <•  rr (Ch  f  nr 

ill) - !l(P)  =  rfO)  andf^7 

in  this  rule,  either  (i)  or  (11)  is  sufficient  to  insure  that  occurrence  f  references 
T{  (P)  for  the  point  P  e  J  before  g  references  the  same  element  for  Q  CJ  . 

The  conditions  can  be  rewritten  in  the  following  equivalent  formr 
(1)  n(P)  <  Tt(0)  and  (li;  ifn(P)  =  n(2)  then  f«g. 

In  the  same  way  c2  was  obtained  from  cl.  the  above  rule  gives 
the  following: 


(Sl)  £SL every  variable  and  „ndernd  nalr 

occurrences  f.  g  of  the,  u»n,ble.  a,  ,.„t  „n„ 

of  Which  Is  a  generation;  for  every 

X  e  <f  f  q>  with  X  >  0 .  we  must  haw 
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If  tt  satisfies  rule  SI,  then  it  satisfies  the  preceding  rule, 
loop  02)  is  equivalent  to  the  original  loop  (1) . 


so  the  rewritten 


We  have  been  assuming  that  In  rewriting  the  loop  body,  the  order 
of  execution  of  the  occurrences  was  not  changed.  I.e.,  the  l‘  were  replaced 
y  expressions  Involving  J  , . . .  ,J  ,  but  nothing  else  was  done  to  the  loop 
body.  Now  let  us  consider  changing  the  order  of  execution  of  the  occurrences 
That  is,  we  may  change  the  position  of  occurrences  within  the  loop 
oody.  For  example,  we  may  reorder  the  statements. 


Let  f «g  mean  that  f  is  executed  before  g  In  the  rewritten  toon 
body  (12) .  Then  rule  SI  guarantees  tha ,  tire  correct  temporal  ordering  of 
references  is  maintained  when  the  references  were  made  in  the  original  loop 
during  diffsrent  executions  of  the  loop  body.  Having  changed  the  positions 
of  occurrences  in  rewriting  the  loop  body,  we  now  have  to  make  sure  that 
any  two  references  to  the  same  array  element  made  during  a  rlngjg  execution 

of  the  loop  body  are  still  made  In  the  correct  order.  The  following  analogue 
of  rule  Cl  handles  this; 


and_f  precedes  o  in  the  original  ir»r.r 


[henf 


*  Remember  that  there  was  no  point  In  doing  this  before,  since  It  couldn’t 
help  for  asynchronous  processors. 
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Rewriting  this  in  terms  of  the  sets  <f,g> gives  the  following 

rule. 

(S2)  For  every  variable,  and  every  ordered  pair  of 
occurrences  f.  o  of  that  variable,  at  least  one  of 
which  Is  a  generation:  if  0  e.  <f.  Qv  and  f  precedes  q  in 
the  original  loop  body,  then  f  «  Q. 


Rules  SI  and  S2  guarantee  that  the  rewritten  loop  (12)  is  equivalent  to  the 
original  loop  (1) .  Note  that  rule  S2  does  not  involve  tt. 


III.  The  Coordinate  Methrvi 


We  co>ild  now  try  to  solve  the  following  problem:  find  a  rewriting 
of  the  loop  body  (and  the  resulting  «  relation  between  oocurrenoes)  and  a 
mapping  nwhioh  satisfy  rules  SI  and  SI.  and  which  minimise  the  expression 
01) .  This  would  give  a  rewriting  of  the  loop  whioh  is  optimal  in  the  sense 
that  the  outer  DO  J  /.../  DO  J*  loop  has  the  fewest  iterations .  however, 

the  optimality  of  such  a  rewriting  is  illusory,  for  reasons  which  we  will 
now  discuss . 


The  1UIAC-IV  has  64  processors .  The  feasibility  of  a  machine 
with  so  many  processors  is  achieved  by  having  all  processors  operate 
synchronously  with  ,  single  control  unit,  and  by  allowing  each  processor  to 
access  only  its  own  separate  portion  of  memory.  If  processor  12  wants  to 
load  a  data  word  contained  In  processor  S's  part  of  memory,  then  the  follow¬ 
ing  sequence  of  instructions  is  executed  simultaneously  by  each  processor 
number  i,  for  1  s  0  to  63; 


0)  load 

(2)  transmit  data  word  to  processor  i  +  7  (mod  64) . 

This  means  that  the  method  of  storing  arrays  must  depend  upon 
how  they  are  to  be  accessed.  For  example,  consider  the  occurrence 
TO1  -  2  *  J2  -  J3,  j3  )  inside  the  DO  CONC  FOR  ALL  (J2,  J3),  which  we 
generated  before  with  the  Hyperplane  Theorem,  it  necessitates  a  complicated 
space-wasting  format  for  storing  the  array  F.  The  array  would  probably 
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have  to  be  reformated  before  and  after  execution  of  the  outer  DO  J*  loop.  * 

It  appears  that  the  best  results  are  obtained  by  choosing  a 
mapping  J  which  gives  a  loop  with  simple  subscripting  and  a  reasonable 
amount  of  simultaneous  computation.  An  obvious  way  of  choosing  such  a  J 
is  to  let  J  , . . . #J  be  a  permutation  of  the  original  index  variables  1^, . . .  ,In. 
More  precisely,  the  mapping  tt:  Zn-*Zk  is  taken  to  be  a  coordinate  projection 
-  that  is,  a  mapping  for  which  tt[  (a1, . . .  ,an)]  is  obtained  by  deleting  n-k 
coordinates  from  (a^,...,an). 

For  example,  for  n  =  5  we  may  want  to  rewrite  loop  (1)  as 
DOal3  =  i3,  u3 
DO  al4  =  i4,u4 

DO  a  SIM  FOR  ALL  (I1,  I2,  i5  )  e  f  (x,  y ,  z): 

X1  <  x  <uJ,  X2<  y  <u  2,  and  /  <z  c  u5) 

# 

Then  tt  [  (I1 ,  I2,  I3,  I4,  I5)]  =  (I3,  I4  )  and  J[  (I1, . . .  ,I5)]  =(l3,  i4,  I1,  i2,  i5). 
Not'ce  that  if  tt  is  a  corrdinate  projection,  then  the  sets  J.  .  of  loop  (11) 

v..",r 

are  easy  to  compute. 

The  coordinate  method  consists  of  first  choosing  a  coordinate 
projection  tt#  and  then  trying  to  find  a  rewriting  of  the  loop  body  for  which 
S.i  and  S2  are  satisfied.  Since  rewriting  the  loop  body  makes  no  difference 


*  A  precise  statement  of  the  rules  relating  storage  allocation  and  DO  SIMs  is 
contained  in  [  4]  . 


to  condition  (i)  of  SI,  we  must  first  require  that  it  be  satisfied  for  all 
relevant  occurrences  f,  g.  Next,  we  apply  SI  and  S2  to  get  certain  ordering 
relations  «  between  occurrences.  We  must  then  decide  if  it  is  possible 
to  rewrite  the  loop  body  so  that  these  relations  are  satisfied. 

Ia order  to  make  this  decision,  we  need  a  trivial  observation: 
a  use  in  an  assignment  statement  must  precede  the  generation  in  that 
statement..  This  observation  will  be  given  the  status  of  a  rale. 

(S3)  For  any  use  f  and  generation  o 
in  a  single  statement,  we  must  have 
L<<  9  • 

Now  we  add  the  relations  «  given  by  S3  to  those  obtained  from 
SI  and  S2 .  Next,  we  add  all  relations  implied  by  transitivity.  I.  e. ,  when¬ 
ever  f  «  g  and  g  «  h,  we  must  add  the  relation  f  «h.*  If  the  resulting 
ordering  relations  are  consistent  -  that  is,  if  we  do  not  have  f  «  f  for  any 
occurrence  f  -  then  the  loop  body  can  be  rewritten  to  satisfy  the  ordering 
relations .  We  will  describe  the  method  of  rewriting  the  loop  body  by  an 
example . 


*  An  efficient  algorithm  for  doing  this  is  given  by  [  5]  . 
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Consider  the  following  simple  loop: 


DO  24  I*  s  2,  50 
DO  24  I2  =  1,  5 

21  Ad1,  I2)  =  BU1,  I2)  +  ca1) 

®  ®  © 

22  Cd1)  =  B(I*-1,I2) 

©  © 

23  Bd1,  I2  )  =  Ad1  +  1,  I2  )  **  2 

@  @ 

24  CONTINUE 

We  want  to  rewrite  it  as  a  DO  I2/DO  SIM  FOR  ALL  1  loop,  so  we  apply  the 
coordinate  method  with  the  coordinate  projection  ^defined  by  tt[  (r ,I2)]  =  I2. 
We  proceed  as  follows.  (The  calculations  for  steps  1-3  are  shown  in  Table  2  J 

1.  Compute  the  relevant  sets  <f,  g>  for  rules 

SI  and  S2. 

2.  Check  that  SI  (i)  is  not  violated. 

3.  Find  the  ordering  relations  given  by  Sl(ii)  and  S2. 

4.  Apply  S3  to  get  the  following  relations: 

statement  21:  bl  «  al 

cl «  al 


Is  SI  (1) 

Ordering  Relations 

The  Sets 

Violated  ? 

SI  (11) 

S2_ 

<al,  al>  : 

=  (0,0) 

NO 

- 

- 

<al,  a2> 

=  (-1.0) 

NO 

- 

- 

<a2,  al> 

n 

o 

NO 

a2«al 

- 

<b3,  b3> 

=  (0,0) 

NO 

- 

- 

<bl,  b3> 

=  (0,0) 

NO 

- 

bl«b3 

<b3,bl> 

=  (0,0) 

NO 

- 

- 

<b2,b3  > 

*  (-1,0) 

NO 

- 

- 

<b3,b2> 

z*  (1,0) 

NO 

b3«b2 

- 

<cl,cl>  s 

:  (0,0) 

NO 

- 

- 

<cl,  c2> 

=  (0,*) 

NO 

- 

cl«c2 

<c2,  cl> 

=  (0,*) 

NO 

- 

- 

Table  2 
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I 

L 


statement  22:  b2  c  c2 
statement  23:  a  2  «  b3 

5.  Find  all  relations  Implied  by  trcnsitivity: 

b3  «  c2  I  by  b3  «  b2  and  b2  «  c2] 

a2  «  b2  [by  a2  «  b3  and  b3  «b2] 

bl  «  b2  [  by  bl  «  b3  a  nd  b3  «  b2] 

bl  «  c2  [  by  bl  «  b2  and  b2  «  c2] 

a2  «  c2  [  by  a  2  «  b3  and  b3  «  c2] 

6.  Check  that  no  relation  of  the  form  f  «  f  was  found  In 
3  or  5. 

7.  Order  the  generations  In  any  way  which  Is  consistent 
with  the  above  relations  -  i.e. ,  obeying  b3  «  c2. 
We  let  al  «  b3  «  c2  • 

We  then  write 

21  Ad1,  I2  )  » 

© 

23  Bd1,  I2)  E 

© 

22  Cd1 )  c 

@ 

8.  Insert  the  uses  in  positions  implied  by  the  ordering 
relations  (recall  that  a2  «  al): 

Ad*+1.  I2  ) 

© 

21  All1,  I2  )  «  B(l*.  I2)  +  0(1*) 

©  ©  © 
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23  BO1,  I2)  .  .»2 

© 

22  CO1)  =  B(I*  -1,  I2  ) 

©  © 

9.  Add  any  extra  variables  necessitated  by  uses  no 
longer  appearing  in  their  original  statements: 

X^+l,  I2)  eA^  +  l,  I2) 

© 

21  ACI1,  I2  )  =  BO1,  I2  )  +C01) 

©  ©  @ 

23  BO1,  I2)  =  Xd1  +  1,  I2)  **2 

© 

22  Cd1)  =  B(.t1  -l,  I2) 

©  © 

This  finally  gives  us  the  following  rewriting  of  our  original  loop: 
DO  24  I2  s  1,  5 

DO  24  SIM  FOR  ALL  I1  e  fx:  2  <  x  <  50} 

X  (I1  +  1#  I2)  -  Ad1 

21  Ad1,  I2)  =  Bd1,  I2)  +  Cd1 ) 

23  Bd1,  I2)  nXQ1*  1,  I2)  **2 

22  Cd1)  =  Bd1 -1),  I2 

24  CONTINUE 


V.  Further  Remarks 


It  is  easy  to  deduce  a  general  algorithm  for  the  coordinate 
method  from  the  proceeding  example.  The  method  can  be  extended  to 
cover  the  case.of  an  inconsistent  erdering  of  the  occurrences.  In  that 
case,  the  loop  can  be  broken  into  a  sequence  of  sub- loops.  Every  genera¬ 
tion  g  for  which  the  relation  g  <x  g  does  not. hold  can  be  executed  within  a 
DO  SIM  loop.  An  algorithm  for  doing  this  is  described  in  Chapter  4. 

Observe  that  there  are  only  2n-l  choices  of  a  coordinate 
projection  n  for  rewriting  loop  (1) .  It  is  easy  to  try  them  all,  in  decreasing 
order  of  the  amount  of  parallel  computation  achieved,  until  one  is  found 

for  which  the  rewriting  is  possible.  Rule  SI  should  rapidly  eliminate  many 
choices . 

It  may  happen  that  the  rewriting  cannot  be  done  with  any 
coordinate  projection.  In  this  case,  a  more  general  linear  mapping  tt  must 
he  sought,  using  the  approach  developed  before  for  DO  CONC  loops.  For 
example,  no  coordinate  projection  works  for  the  relaxation  loop  (10). 


CHAPTER  IV 


PRACTICAL  CONSIDERATIONS 


& 


I.  Restrictions  on  the  Loo 


Now  consider  the  application  of  these  methods  to  the  problem 
of  compiling  a  FORTRAN  program  for  execution  on  a  multiprocessor  computer. 

We  immediately  observe  that  the  restrictions  which  have  been  placed  on  the 
loop  (1)  would  eliminate  most  real  Fortran  DO  loops  from  consideration.  For 
example,  the  DO  limits  x1,  u*  are  usually  not  all  constants  known  at  compile 
time.  Fortunately,  most  of  the  restrictions  were  made  to  simplify  the  exposi¬ 
tion,  and  are  not  essential.  We  will  now  describe  the  restrictions  which  are 
essential  to  the  analysis . 

First,  some  terms  must  be  defined.  By  a  "loop  constant"  ,  we 
mean  an  expression  whose  value  does  not  change  while  the  loop  is  executed  - 
i.e.,  any  expression  not  Involving  generated  variables  or  loop  index  variables. 
A  quantity  is  "known  at  compile  time"  if  it  has  a  constant  value  which  can 
be  deteimined  by  the  FORTRAN  compiler. 

The  analysis  can  be  applied  to  the  following  loop: 

(13)  DOo^  x1,  n1,  d1 

DO  a  In  s=  /*,  nn,  dn 

loop  body 

a  CONTINUE 

assuming  that  it  satisfies  the  following  conditions: 


-66- 


1.  Each  d*  is  known  at  compile  time. 

2.  The  loop  body  contains  no  transfer  of  control  to  any 
statements  outside  it. 

3.  There  is  no  I/O  statement  in  the  .loop  body. 

4.  For  each  subroutine  or  function  call  in  the  loop  body,  it  is 
known  which  variable  elements  it  can  modify. 

5.  Each  occurrence  of  a  generated  variable  must  be  of  the  form 

VAR  (e1 . em),  with 

.  <*In  +  c1  , 

where  c  is  a  loop  constant  and  each  aj  is  known  at  compile 
time. 

I\>r  the  coordinate  method,  the  following  additional  assumptions  arc  required. 

6.  There  is  no  transfer  of  control  within  the  loop  body. 

7.  For  every  generated  variable  VAR,  each  occurrence  of  VAR 
within  the  loop  body  must  be  of  the  form 

VAR  (a1  *  I^1  +  c*, . . . ,  am  *  I^m  +  cm  ) , 

i  j 

where  c  is  a  loop  constant,  a1  «  +1  or  0,  aHd  the  a1  and  jj  are 
the  same  for  all  occurrences  of  VAR. 

By  weakening  the  restrictions,  many  complicated  details  are 
added  to  the  process  of  rewriting  the  loop.  However,  the  analysis  remains 
largely  unchanged.  Some  of  these  details  are  described  in  Chapter  4. 

A  significant  change  is  introduced  by  allowing  occurrence  meop- 
ings  of  the  form  given  in  5.  It  necessitates  a  complicated  restating  of  the 
Hyperplane  and  Plane  Concurrency  Theorems,  as  well  as  changing  the  method 


of  choosing  the  mapping  tt.  This  will  be  discussed  in  a  future  paper. 

'  Note  that  if  the  loop  (13)  satisfies  these  restrictions,  then  so 
does  the  inner  DO  I*/ •  •  •/  DO  In  loop,  .for  any  k .  (The  index  variables 
1^  for  J  <  k  are  loop  constants  for  the  inner  loop.). 
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II.  Meeting  the  Restrictions 


Even  if  a  given  loop  does  not  satisfy  the  above  restrictions ,  it 
may  be  possible  to  rewrite  it  so  that  it  does.  V/e  will  give  some  useful 
techniques  for  doing  this . 

It  is  easy  to  fulfill  the  requirement  that  the  DO  statements  be 
tightly  nested.  -  The  method  is  illustrated  by  the  following  example.  The  loop 

DO  77  I1  =  1,  10 
16  A  (I1,  1)  =  0 

DO  77  I2  =  2,  20 

can  be  rewritten  as  the  following  tightly  nested  loop: 

DO  77  I1  =  1,  10 
DO  77  I2  =  2,  20 

16  IT  (I2  .EQ.  2)  Ad1,  I2-l)  =0 

This  technique  is  referred  to  as  quantifying  statement  16.  It  may  be  possible 
to  move  the  statement  back  outside  the  DO  I  loop  and  unquantify  it  after  the  • 
rewriting  is  performed. 

Occurrence  mappings  can  sometimes  be  rewritten  by  substituting 
for  generated  variables  so  that  condition  5  is  met.  One  trick  is  illustrated 
by  the  following  example .  Given 


KcN 


I 

I 


DO  6  1=1,  N 

5  B(D=  A  (K) 

6  K  =  K  -  1 
We  can  rewrite  it  as 

DO  51  1  =  1,  N 
5  B(I)  =  A  (N  +  1  -  I) 

51  CONTINUE 
61  K  =  1 

This  use  of  auxiliary  variables  to  effect  negative  incrementing  is  fairly 
common  in  FORTRAN  programs. 

In  condition  6 ,  the  real  restriction  is  that  there  can  be  no 
possible  loops  inside  the  loop  body.  If  this  is  the  case,  then  transfer  of 
control  can  easily  be  eliminated  by  quantifying  assignment  statements  with 
logical  IFs . 


HI.  Scalar  Variable  s  ' 

Even  though  the  loop  satisfies  all  the  restrictions,  it  is  clear 
that  these  methods  can  give  no  parallel  computation  if  there  are  generated 
scalar  variables.  Any  such  variable  must  be  eliminated. 

A  common  situation  is  for  the  variable  to  be  just  a  temporary 
storage  word  within  a  single  execution  of  the  loop  body.  The  variable  X  in 
the  following  loop  is  an  example 

DO  3  1  =  1,  10 
X  =  SQRT(A(I)) 

B(I)  =  X 

3  C(I)  =  EXP  (X) 

In  this  loop,  each  occurrence  of  X  can  be  rer+aced  by  XX(D ,  where  XX  is  a 
new  Variable . 

In  general,  we  want  to  replace  each  occurrence  of  the  scalar 
by  VARd1, . . . ,  In) ,  for  a  new  variable  VAR.*  A  simple  analysis  of  the  loop 

•  i 

body's  flow  path  determines  if  this  is  possible. 

Another  common  situation  is  when  the  variable  X  appears  in  the 
loop  body  only  in  the  statement 

X  =  X  expression, 

where  the  expression  does  not  involve  X.  ihis  statement  Just  forms  the  sum 
of  the  expression  for  all  points  in  the  index  set>J?  .  We  can  replace  it  by 

♦After  the  rewriting,  to  save  space  we  can  lower  the  dimension  of  VAR  by 
eliminating  any  subscript  not  containing  a  DO  FOR  ALL  index  variable. 


the  statement 


I 
I 
I 

!  VAR(I*, . . .  ,In)  =  expression, 

and  add  the  following  "statement"  after  the  loop: 

L  .  X=X+  E  VARd1,  . 

01 . 1  ")cj 

L  The  sum  can  be  executed  In  parallel  with  a  special  subroutine . 

j  The  same  approach  applies  when  the  variable  is  used  in  a 

similar  way  to  compute  the  maximum  or  minimum  value  of  an  expression 
i  for  all  points  in^(  . 


IV.  Conclusion 


We  have  presented  methods  for  obtaining  parallel  execution  of 
a  given  DO  loop.  Many  details  and  refinements  were  omitted  for  simplicity, 
but  all  the  basic  ideas  have  been  included.  Some  of  the  methods  are  being 
Implemented  in  the  ILLIAC-IV  FORTRAN  compiler,  as  described  in  Chapter  4. 
Preliminary  study  indicates  that  they  will  yield  parallel  execution  for  a 
fairly  large  class  of  programs.  This  is  true  for  other  types  of  multiprocessor 
computers  as  well. 
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1, 


Introduction 


1.1  Review 

The  previous  semi-annual  report  described,  in  a  somewhat  rigorous 
fashion,  techniques  for  the  detection  of  parallelism  in  FORTRAN  DO  loops 
It  was  indicated  in  that  report  that  this  procedure  can  be  divided  into  three 
steps: 


a)  Rewriting  part  of  the  program  as  a  sequence  of  DO  loops  in  a  form 
suitable  for  analysis  (Setup  step) . 

b)  Detecting  computations  in  the  loops  which  can  be  efficiently 
performed  in  parallel  by  ILLIAC  (Analysis  step) . 

c)  Rewriting  the  loops  using  the  Extended  FORTRAN  DO  FOR  ALL  state¬ 
ment  to  explicitly  describe  the  parallelism  (Synthesis  step) . 

The  intersection  of  the  above  Setup,  Analysis,  and  Synthesis  steps 
is  not  empty;  i.e. ,  there  is  overlap. 

Two  methods  for  accomplishing  the  above,  i.e. , 

a)  Coordinate  Method 

b)  Hyperplane  Method 

were  described  in  the  previous  semi-annual  report. 

1.2  Summary 

Discussions  of  the  Parallel  Analyzer  contained  in  this  report  will  be 
oriented  towarrir  implementation  activities  accomplished  during  the  past 
half  year  and  those  planned  for  the  forthcoming  months. 
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A  preliminary  implementation  of  the  basir  Parallel  *  i 
has  been  completed  and  debugged  TUS  ,  7  ™  °0,”pone”ts 

since  It  is  missing  many  parts  The  I  t !  '  nal  ?a'allel 

Step  (see  121  r  ,  ,  ,  '  tapor1ant  omission  Is  a  Synthesis 

m  thii  J  ’  ^gorithms  are  oriented  toward  the  Coordinate 

with  minimal  attention  given  to  the  Hyperplane  Method. 

Algorithms  included  in  the  preliminary  system  relate  to: 

a)  The  locating,  tight  nesting,  and  rewriting  of  DO  nests. 

b)  The  collecting  and  validating  of  all  array  references  anrf  th* 

tion  of  the  subscripts .  reterences  and  the  permuta- 

limits  C)  ^  !0niPUUn9  °f  UPP6r  and  l0Wer  bounds  for  DO  variables  when 
limits  are  not  known  at  compile  time. 

d)  The  computing  of  data  dependency  relations  between  array  uses 
a*  occurs  (i.e..  the<f„>  sets  described  In  the  las,  semi-annual 

ellmln.e!,o?VlleCUn8  *  *“  °°  SIM  V‘rIables  »nd  the 

S6tS  °bVlOUSly  “  D°  •  ln  any 

the  numh^f^1"9  ^  ^  **“  CUmnt  statement  macro  position  to  reduce 
.  .  °°  varlabies  ln  a  nest,  thus  perhaps  eliminating  violations 

which  occurred  with  a  larger  DO  variable  set. 

tlon  Bo  I08”6”1  U“Uty'  lnClUdln9  routlnes  for  character  suing  manlpula- 

mll'atel  trlX”,anlPUlaUOn'  eXpre“1°»  manipulation.  Inter¬ 

mediate  language  generation,  and  debugging  assistance. 

of  f  1  Sr°"  2  deS<Tlbes  the  preMnt  environment  and  Includes  a  discussion 

Step  lncluT  V,  PreSentll,  i"ple”’en,e<i  algorithms  for  the  Setup 

including  (a)  and  (f)  above.  Section  4  describes  presently  i-m _ i 

901  s  f0r  the  Ana'l'sls  Step>  including  (b),  (c).  <d).  (e)  above, 
ahead  ^77  '  pro)ected  Implementation  work  for  the  months 

Z  00ZZ .VerSl0n  -  ,ha  18  — -  -  April  30. 
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Environment 


function  as"!  1,'ary  Par°Ue‘  Anal'rzer  and  its  environmental  routines 

zzizzzzrr  :r is  “*— *  -  -  z* « 

Will  be  the  output  Of  the  Parse.  ^  'he  ^  *°  **’*  t’araUei  A>»^« 


2,1  Generating  the  Tnp.it 

the  present  environment  inteoer  , .  ,k  ,  "*  mea"In^1 

extensive  facilities  for  descsibZ 
(the  crux  of  Parallel  Analysis, 

2  *2  Debugging  Aide 

The  present  system  contains  facilities  for  obtaining  »  * 

.....  ««. ...  <_«. 


2,3  Manipulation  of  Linear 

A  collection  of  utility  routines  for  the 


the  form 


i  inipulation  of  expressions  of 


V 


n 

Z  c‘  **• 

1  t!  1 


Where  oo  and  the  c,  are  Intepers  and  the  x,  are  scalars.  Is  Included. 


i 


Such  an  expression  can  be: 


a)  substituted  for  any  variable  in  any  other  expression  of  the  same 

form, 

b)  compared  for  equality  with  other  expressions  of  the  same  form, 

c)  added  (subtracted)  to  (from)  another  expression  of  the  same  form 
and 

d)  multiplied  by  an  integer  constant. 


Thes*  Unear  manipulation  functions  are  required  in  the  computation  of 
<f,g>  sets  and  upper  bounds  and  are  utilized  by  the  re-writing  and 
re-formatting  algorithms  of  both  the  Basic  Coordinate  and  Hyperplane  Methods. 


2,4  Simple  Intermediate  Language  Manipulation 

Routines  for  deleting,  duplicating,  and  creating  symbols,  constants, 
labels,  expression  level  macros,  statement  level  macros,  and  arbitrary 

operands  are  included.  These  basic  functions  are  required  for  any  Inspection 
and  alteration  of  computation  trees. 


2  -5  Boolean  Matrix  Manipulation 

A  package  for  the  manipulation  of  nxn  bit  matrices  is  included.  In 
addition  to  the  obvious  simple  functions,  routines  for  computing  the  trans¬ 
pose  and  computing  the  transitive  closure  are  included .  Routines  of  this 
sort  are  required  to  record  program  flow,  detect  cycles,  etc. 

2  •  6  String  Handling  Package 

Because  of  the  lack  of  anything  resembling  a  proper  handling  of  char¬ 
acters  within  the  FORTRAN  language,  a  standard  string  manipulation  package 
is  Included.  Functions  of  the  package  Include:  scanning,  decimal  and  octal 
conversion,  concatenation,  insertion,  extraction,  comparison,  and  substitution. 
This  package  contains  the  tools  necessary  for  intelligent  interaction  with  the 
user  and  for  the  generation  of  reports.  These  are  functions  the  Parallel 
Analyzer  is  expected  to  ultimately  perform . 


3  •  Setup  Step 

with  Jel  “  d,iSCUSSeS  ,he  Implemented  algorithms  dealing 

With  the  Setup  Step'  of  the  overall  Parallel  Detection  process  (The  ,  7 

rrsr^rr rrsisrrbr1 

;"  C‘  h  t  “  d0dy  S,atements  are  processed.  Scanning  continues 
until  the  level  of  DO's  reaches  sere  (l.e. .  the  number  of  DO  ills  Luals 
the  number  of  DO  statement*?)  equals 

variehie  S) '  This  Pro«dure  suggests  that  as  many  DO 

s  is  possible  are  attached  to  the  nest.  However,  If  subsequent 

-  larysls  ,„ils,  backup  procedures  to  reduce  the  number  of  DO  variables  are 

then  -  -  -  -  -  wi““ are 


stat.mlhr  PT,T  :“PlementaUon  "»<>«*  »"y  statement  In  the  DO  body 

==rsr:s=r.r=::s~ 

the  present  Implementation  permits  no: 

a)  control  statements  (e.g. ,  GOTO,  Arithmetic  IF), 

b)  subroutine  CALL  statements,  nor 

c)  Input/output  statements. 

Furthermore,  function  calls  ore  outlawed.  These  restrictions  will,  of 
course,  be  relaxed  In  subsequent  implementation  versions. 

statements^1”"18  **  “  aU  non-DO 

r  V  ~  d°  ^  f°UOW  the  Innerm0at  1X5  opa™'«  -d  Precede 
the  firs,  DO  label.  This  procedure  1s  best  described  by  an  example. 

*  3  ,NI 


20 


DO  10  I 
A  (I)  =  2 

DO  20  J  =  1  ,NJ 
B(I,J)  =  C(IJ) 


10  IF(D(0)  E (0  =  F0) 


be  the  original  statements.  The  Ugh.  nesting  algorithms  of  the  Setup  Step 
would  make  the  above  statements  appear  as: 

DO  30  I  =  3 ,NI 
DO  30  J  =  1,NJ 

IF(J.EQ.1)A(D  =  2 
B(I ,J)  =  C(I  J) 

IF (D(I) . AND .  J .  EQ .  NJ)  E(I)  =  F(0 
30  CONTINUE 

Present  Setup  algorithms  do  not  look  for  scalars,  sums,  etc. ,  nor 

completely  validate  tee  legality  of  tee  Ughtly  nested  code,  for  all  cases. 

This  was  deliberately  omitted,  but  subsequent  Implementations  will  be 
extended  In  this  direction. 

The  DO  statements  themselves  are  'analyzed'  during  the  Setup  phase 
as  follows:  p 


a)  The  DO  variable  Increment  must  be  a  poslUve  or  negaUve  Integer. 

b)  The  DO  variable  lower  limit  or  upper  limit  must  be  expressible  as 
follows: 


m 


k+ 


I 


1  =  1 


n 


kixi  + 


Z 


CJVJ 


J  *=  1 


where  k,  kj,  are  integers,  the  xA  are  scalars  and  are  not  DO  variables 
or  t  e  present  DO  body,  and  the  Vj  are  DO  variables  but  not  the  present 
DO  variable  nor  any  DO  variable  interior  to  it. 


If  successful,  the  Setup  algorithms  create  (or  partially  create)  certain 

tables,  which,  together  with  updated  intermediate  language  information,  sen 
as  input  to  the  Analysis  Step. 

One  of  the  outputs  is  an  n  x  n  precedence  matrix  which  records  the 

original  flow  of  the  loop  body  where  n  is  thp  number  of  statements  in  the 
loop  body . 


4.  Analysis  Step 


This  section  discusses  the  algorithms  (presently  implemented)  which 

deal  with  the  'Analysis  Step'  of  the  overall  Parallel  Detection  process. 

«  *- 

4.1  Background 

The  present  Analysis  (based  on  the  Coordinate  Method)  generates 
candidate  DO  SIM  sets  and  eliminates  obvious  bad  sets.  An  exhaustive 
analysis  of  the  DO  SIM  sets  to  determine  the  'degrees  of  goodness'  of  the 
non-bad  sets  is  made . 

Activities  in  the  Analysis  Step  (as  well  as  the  predecessor  Setup  step) 
occur  on  a  local  basis  only.  Global  considerations,  for  the  present  imple¬ 
mentation,  are  not  considered.  Global  considerations  relate  to  issues  such 
as: 

a)  arrays  in  COMMON, 

b)  equivalenced  arrays, 

c)  external  subroutines  and  functions, 

d)  formal  parameters  (arrays)  received  and  passed,  and 

e)  the  effect  the  rewriting  of  one  loop  within  a  program  unit  has 
upon  the  parallelization  of  another  loop  within  the  same  program  unit. 

4.2.1  What  Exists 

The  Setup  Step  will  have  located  a  DO  nest,  extracted  the  DO  state¬ 
ments,  tightly  nested  the  DO  body  and  recorded  this  information  in  Parallel 
Analyzer  tables  and  in  upgraded  intermediate  language  tables. 

The  first  step  performed  here  is  a  tree  walk  of  each  statement  in  the 
loop  body  which  searches  for  array  references  (generations  and  uses).  A 
generation  is  an  array  reference  to  the  left  of  the  substitution  operation; 
any  other  appearance  is  a  use. 
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A  table  of  arrays  and  array  elements  Is  built .  Subscripts  are  permuted 
to  conform  to  the  order  of  the  DO  variables .  Rejects  occur  If  references  to 
the  same  array  have  different  permutations.  Subscript  expressions  are  k  „  a 

r„zr they  do  not  vioiate  - — — -  ~::u:::rd 

limit  ,The  COmP1°neMS  01  Moh  D0  sta,ama"'  a... .  the  lower  limit,  upper 
o-  in  remen,  are  then  analyzed.  If  all  these  quantities  are  known  at 

cmnpile  time  this  analysis  is  simple.  If  „«,  upper  bounds  on  the  limits 
are  computed  as  a  function  of  the 

a)  non -Integer  limit  expressions,  and  the 

b)  dimension  information  from  the  symbol  table. 

struct  ^\co”venient'  OTdarad  «*  of  all  possible  DO  SIM  sets  Is  then  con- 

BpC"USe  01  prevlous  a"a*“s  a°™  of  these  sets  can  a,  this 
point  be  declared  •bad',  l.e. ,  totally  unsulted  for  parallelism. 

in  e  “r  the"  lnVOked  “  COmPUte  data  decency  information  - 
ffect  the  <f.g>  sets  described  In  the  last  semi-annual  report.  This 

information  is  recorded  for  subsequent  analysis. 

Portions  of  the  basic  Coordinate  algorithm,  operating  over  the  data 

(in  Mrricubi<th9>  SetS‘  18  the"  taVOked'  Info™»Uon  Is  recorded  In  tables 
On  particular  the  candidate  for  DO  SIM  table)  declaring  a  set  as  bad,  good 

or  questionable.  If  good  or  questionable,  data  is  accumulated  for  sub-  ’ 

sequent  analysis  and  selection. 

”  Sth°UW  be  POlnted  °ut  that  «he  analysis  to  date  has  been  directed 
towards  the  coordinate  Method  as  opposed  to  the  Hyperplane  Method. 

However,  Ml  Hyperplane  Method  considerations  were  no.  eliminated  when 
implementing  the  present  version. 
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5 .  The  Next  Version 

The  next  version  of  the  Parallel  Analyzer  is  scheduled  for  April  30,  1972 
completion.  This  section  discusses  what  will  be  appended  to  the  current 
Parallel  Analyzer  to  produce  this  next  version. 

5.1  General 

Minor  changes  will  be  made  to  the  tight  nesting  and  some  handling  of 
scalars  will  be  done  (see  Setup  extensions  below) . 

Exhaustive  analysis  to  select  proper  DO  SIM  variables  will  be  performed. 
Data  and  statistics  leading  to  acceptance  or  rejection  will  be  studied.  (See 

Analysis  below.)  Selected  loops  will  be  rewritten  using  IVTRAN  DO  FOR  ALL 
statements. 

The  Parallel  Analyzer  will  continue  to  function  as  a  computer  program 
independent  of  the  rest  of  the  ILLIAC  compiler.  Output  reports  will  be 
Improved . 

No  new  Implementation  work  will  be  performed  in  the  following  areas: 

a)  Hyperplane  Method 

b)  Inconsistent  Orderings  (cycles  within  .DO  body) 

c)  Transfer  Outside  Loop 

d)  Backward  Transfer 

e)  Creation  of  DO  loops  from  non-explicit  DO  statements 

f)  Forward  Transfers 

g)  Subroutine  Calls 


Soma  enhancements  will  be  made  to  the  tight  nesting  procedure.  This 
procedure  will  be  interlaced  with  new  procedures  for  handling  generated 
scalars . 


When  resolving  unwanted  scalars,  certain  common  special  cases  will 
ae  looked  for  first.  In  particular,  is  the  scalar  a  summing  variable?  Arrays 
should  be  introduced  to  eliminate  storing  into  scalars  within  the  body  of  the 
loop. 


5.3  Analysis  Step  Extensions 

The  candidate  for  DO  SIM  table  is  passed  over  and  updated  in  the 
following  fashion.  Sets  declared  by  previous  analysis  as  bad  are  avoided. 

A  candidate  set ,  say  II ,  is  selected  for  examination .  An  m  x  m  precedence 
ait  matrix  P  for  II  based  on  the  data  dependency  relation  of  the  <f,g>  sets 
and  the  original  order  of  the  statements  is  constructed  where  m  is  the  number 
of  array  occurrences. 

If  any  intra- statement  dependencies  are  unveiled  from  inspection  of  P, 

II  is  marked  as  bad- and  a  new  II  is  selected.  The  transitive  closure  of  P 
is  computed  and  stored  in  Q .  The  number  of  cycles  within  Q  is  counted  and 
recorded.  If  there  exist  any  cycles,  the  inconsistent  ordering  algorithms 
must  be  Invoked  if  this  DO  SIM  set  is  ultimately  selected  as  the  best.  The 
Inconsistent  ordering  routines  (as  pointed  out  earlier)  will  not  be  available 
in  the  next  version . 

Simple  and  nice  (indeed  perhaps  the  most  common  in  real  programs) 
situations  are  looked  for  and  special  attention  is  given  to  them  in  an  effort 
to  further  exploit  the  parallelism.  Such  situations  include: 

a)  only  one  DO  variable  in  the  DO  SIM  set, 

b)  no  data  dependency  sets,  i.e. ,  no  <f,g>  sets, 

c)  <f,g>  =  null  for  all  sets, 

d)  <f,g>  =  0  for  all  sets,  etc. 
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\  linear  ordering  lor  the  statements  Is  constructed,,  The  number  of 
uses  which  have  to  be  moved  is  tallied  and  recorded.  The  above  computa¬ 
tions  are  saved  in  case  this  II  is  the  one  ultimately  chosen.  The  procedure 

then  cycles  and  selects  another  n ,  if  any,  from  the  candidate  for  DO  SIM 
tables. 

Presented  with  possibly  n  different  DO  SIM  sets,  algorithms  will  be 
needed  to  select  the  ’best'  on  the  basis  of  previously  recorded  data.  The 
data  available  for  inspection  at  this  point  should  serve  as  a  heuristic  in  the 
creation  of  the  final  algorithm.  Data  available  will  ultimately  concern: 

a)  cycles  and  structure, 

b)  number  of  moves  of  array  uses, 

c)  frequency 

d)  user  declared  information  ('control  can*'  information. 

Interactive  acquisition) , 

e)  allocation  checks ,  etc. 

A  DO  SIM  set,  say  a  is  chosen.  If  a  contains  cycles,  algorithms 
for  handling  Inconsistent  orderings  should  be  invoked.  If  data  computed 
above  (i.e.,  precedence  matrix  P,  transitive  closure,  linear  ordering)  for 
&  has  been  dumped,  it  should  be  restored  at  this  point  in  preparation  for 
what  is  called  the  'Synthesis  Step' . 

5 .4  Synthesis 

After  selecting  a  suitable  DO  SIM  seta,  the  final  task  is  the  rewriting 
or  the  loop.  The  task  of  rearranging,  adding,  regenerating,  and  rewriting 
statements  has  been  simplified  because  of  the  rich  collection  of  utility 
routines  (2.3,  2.4)  available. 


Further  analysis  may  have  to  be  Imbedded  In  the  Synthesis  algorithms- 
xample,  some  time  measure  of  the  number  of  new  Instructions  added  to 
the  new  loop.  etc.  It  would  be  possible  (with  proper  frequency  McJJL 
to  predict  this  In  the  Analysis  proper  section  a  S  information) 

however,  may  be  desired .  A  m°re  aCC“rate  me”“re' 

langu.de* rebT  "  "T"1”  °CCUrS  the  ,he  ‘"‘ennedlate 

had  “Led  ""  reS,°red  ,0  aPP6ar  aS  W  "  P"8llel  attempts 

5 .5  Conclusion 

This  ve  TTT  tapleraentaUon  of  Parallel  Analyzer  has  been  completed 
s  version  Isolates  DO  loop  ranges  and  performs  early  analysis  The 

naxt  implementation  will  continue  with  further  analysis  and  synthesis 

f*  "  ,h“  BaSl°  Cootdlna,e  Meth«1-  Activities  will  remain  at  a 

local  level  with  global  considerations  deferred. 
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